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ABSTRACT 



In this thesis we relate the performance of hashing algorithms to the notion of 
clustering, that is the pile-up phenomenon that occurs because many keys may probe 
the table locations in the same sequence. We will say that a hashing technique exhibits 
k-ary clustering if the search for a key begins with k independent random probes and 
the subsequent sequence of probes is completely determined by the location of the k 
initial probes. Such techniques may be very bad; for instance, the average number of 
probes necessary for insertion may grow linearly with the table size. However, on the 
average (that is if the permutations describing the method are randomly chosen), k-ary 
clustering techniques for k>1 are very good. In fact the average performance is 
asymptotically equivalent to the performance of uniform probing, a method that exhibits 
no clustering and is known to be optimal in a certain sense. 

Perharps the most famous among tertiary clustering techniques is double hashing, the 
method in which we probe the hash table along arithmetic progressions where the initial 
element and the increment of the progression are chosen randomly and independently 
depending only on the key K of the search. We prove that double hashing is also 
asymptotically equivalent to uniform probing for load factors a not exceeding a certain 
constant uq = .31... . Our proof method has a different flavor from those previously used 
in algorithmic analysis. We begin by showing that the tail of the hypergeometric 
distribution a fixed percent away from the mean is exponentially small. We use this 
result to prove that random subsets of the finite ring of integers modulo m of cardinality 
am have always nearly the expected number of arithmetic progressions of length k, 
except with exponentially small probability. We then use this theorem to start up a 
process (called the extension process) of looking at snapshots of the table as it fills up 
with double hashing. Between steps of the extension process we can show that the 
effect of clustering is negligible, and that we therefore never depart too far from the 
truly random situation. 
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IV. 



NON-STANDARD NOTATIONS USED 



C(m,n) denotes the binomial coefficient that counts the number of ways to choose n things 
out of m without repetitions = m!/n!{m-n)!. 

C(m; n^,n2,.-Mn^^), where n^+n2+...+n^ = m, denotes the multinomial coefficient that equals 
m!/n^!n2!...n^!. 

For positive x, y x = y(1±£) means x € (y(1-£), y/(1-£)) if £<1 and x < y(1+£) otherwise. 
Further notation is as in [Knuthi] or as defined in the text. 
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CHAPTER 1: 

INTRODUCTION 



In this chapter we introduce the basic notions of hashing and of algorithmic 
analysis. We define terminology and notation to be used throughout this 
thesis. Finally we present a summary of the results to be proved. 
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1.1. Hashing Algorithms. 

Hashing algorithms are a certain type of search procedure. We assume that we are given a 
set of records, where each record R is uniquely identified by its key K. Besides K the 
record R contains some unspecified useful information in the field INFO, as depicted in 
Figure 1.1.1. 



We wish to organize our records in such a way that (1) we can quickly find the record 
having a given key K (if such a record exists), and (2) we can easily add additional records 
to our collection. Since all retrieval and update requests are specified exclusively in terms 
of the key of a record, we will ignore the INFO field in most of the discussion that follows. A 
straightforward way to implement this organization is to maintain our records in a table. A 
table entry is either empty, or it contains one of our records, in which case it is full. We can 
look for a record with a given key by exhaustively examining all entries of the table. 
Similarly, a new record can be inserted into the table by searching for an empty position. It 
is clear that, unless we are careful, the searches in question can become quite protracted for 
a large collection of records. 



The idea of hashing is that of using a transformation h on the key K which gives us a "good 
guess" as to where in the table the record containing our key K is located. Suppose our 
table has m entries or positions, numbered 0,1,...,m-1. Then h maps the universe of keys, 

which we assume very large, into the set {0,1 m-1}. We call h a hash function, and depict 

it as a mapping, as in Figure 1.1.1. 

If h(K) = s, then we will say that key K hashes to position s. Naturally, several keys may 
hash to the same position. Thus if we are trying to insert a new key K into the table, it may 
happen that entry h(K) of the table is already occupied by another key. In that event we 
need a mechanism for probing the rest of the table until an empty entry is found. We will 
speak of a probe that encounters a full entry as a collision, and we will call our mechanism a 
collision resolution strategy. (It may, of course, happen that we are trying to insert a new 
key into an already full table, in which case we have an overflow.) Upon a retrieval request 
for the same key, we follow the same probe path until the record containing the key is found. 

We will assume that our collision resolution strategy is such that every table position is 
examined exactly once before we return to the original location. The particular probe path 
we follow during a search may depend on the key K and the state of the table at that 
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moment, as the examples of the next section will make clear. We will also assume that our 
hash function selects each of the table entries with equal probability. It is intuitively clear 
that we want our function h to "randomly scatter" the keys over the entire table as much as 
possible. We will elaborate on these probabilistic concepts in Section 1.3. For the moment 
the point we wish to make is that, once the "uniformity" of h has been assumed, the collision 
resolution strategy alone fully determines the behavior of the algorithm. Thus every hashing 
algorithm we consider naturally breaks up into two parts: (1 ) the construction of the hash 
function h mapping the universe of possible keys into the set {0,1,...,m-1} so that each set 
member is chosen with approximately equal probability, and (2) the formulation of an 
efficient collision resolution strategy. Since in this thesis we are only concerned with the 
analysis of the performance of hashing algorithms, we will completely ignore the problem of 
constructing good hash functions. Similarly, if we use any additional randomizing 
transformations (hash functions) in the collision resolution strategy, we will only need to 
know the probability distribution of the values of such transformations. We will not concern 
ourselves with how such mappings can be explicitly constructed, given a specific universe 
of keys. 



1.2. Open Address Versus Chained Hash Techniques. 

A hashing algorithm is an open addressing method if the probe path we follow for a given 
key K depends only on this key. Thus each key determines a permutation of {0,1,...,m-1} 
which indicates the sequence in which the table positions are to be examined. Let n denote 
the number of records currently in the table. Perhaps the two best known open addressing 
hash algorithms are linear probing and double hashing. We use the descriptions of these 
algorithms given in [Knuth23: 

ALGORITHM L (Linear probing). This algorithm searches an m-node table, looking for a 
given key K. If K is not in the table and the table is not full, K is inserted. 

The nodes of the table are denoted by TABLE[i], for 0<i<m, and they are of two 
distinguishable types, empty and occupied. An occupied node contains a key, called 
KEY[i], and possibly other fields. An auxiliary variable n is used to keep track of how many 
nodes are occupied; this variable is considered to be part of the table, and it is increased by 
1 whenever a new key is inserted. 

This algorithm makes use of a hash function h(K), and it uses a linear probing sequence to 
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address the table. 

L1. [Hash.] Set i *• h(K). (Now 0<i<m.) 

L2. [Compare.] If KEY[i] = K, the algorithm terminates successfully. Otherwise if TABLE[I] 
is empty, go to L4. 

L3. [Advance to next.] Set i ♦■ i-1; if now i<0, set I ♦■ i+m. Go back to step L2. 

L4. [Insert.] (The search was unsuccessful). If n = m-1, the algorithm terminates with 
overflow. (This algorithm considers the table to be full when n = m-1, not when n = m.) 
Otherwise set n ♦■ n+1, mark TABLE[i] occupied, and set KEY[i] «- K. I 

ALGORITHM D (Open addressing with double hashing). This algorithm Is almost identical to 
Algorithm L, but it probes the table in a slightly different fashion by making use of two hash 
functions h^(K) and h2(K). As usual h^(K) produces a value between and m-1, inclusive; 

but h2(K) must produce a value between 1 and m-1 that is relatively prime to m. (For 

2< 



P 

example, if m is prime, hp(K) can be any value between 1 and m-1 inclusive; or if m = 2 , 



p 
h2(K) can be any odd value between 1 and 2-1.) The probe sequences in this case are 

arithmetic progressions. 



D1. [First hash.] Set i <- h^(K). 

D2. [First probe.] If TABLE[i] is empty, go to D6. Otherwise if KEY[i] = K, the algorithm 
terminates successfully. 

D3. [Second hash.] Set c ♦- h2(K). 

D4. [Advance to next.] Set i <- i-c; if now i<0, set i <- i+m. 

05. [Compare.] If TABLE[i] is empty, go to D6. Otherwise if KEY[i] = K, the algorithm 
terminates successfully. Otherwise go back to D4. 
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D6. [Insert.] If n = m-1, the algorithm terminates with overflow. Otherwsie set n <- n+1, 
mark TABLE[i] occupied, and set KEY[i] <- K. I 

We note that the main difference between these two algorithms is that in double hashing the 
decrement distance c can itself depend on the key K. As we will see later, this additional 
degree of freedom can have profound effects on the performance. 

In non-open addressing methods the probe path of a key K depends on the previous history 
of the table. This is usually accomplished by storing in each record an additional LINK field 
which can be a pointer to another entry of the table. The probe path of a key is then 
determined by hashing to a location s and following the links from that location. If we hash 
to an empty entry or we come upon a null link, we know that the record we are looking for is 
not in the table. We will call such hash algorithms chained. Among the simplest and most 
widely used chained techniques are the following two: 

ALGORITHM A (Bucket search). We assume that we have m list-heads HEAD[i], 0<i<m-1, 
each pointing to (a possibly empty) list of records. Each record has an additional LINK field 
that can be a pointer to another record, or null. The lists of the algorithm are kept disjoint. If 
a new record has h{K) = s, then this record is added to the end of the list pointed to by 
HEAD[s]. We also assume that we have an operation x <= AVAIL that makes x point to a 
block of memory where the new record can be stored. 

A1. [Hash.] Set i <- h{k). 

A2. [Is there a list?] If HEAD[i] = null then let j <= AVAIL, set HEAD[i] *- ] and go to A5, 
else set i <- HEAD[i]. 

A3. [Compare.] If K = KEY[i] the algorithm terminates successfully. 

A4. [Advance to next.] If LINK[i] ^t null then set i <- LINK[i] and go back to A3, else let ] 
<= AVAIL and set LINK[i] «■ j. 

A5. [Insert.] Set KEY[j] *• K, and LINK[j] <- null. 
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ALGORITHM C (Chaining with coalescing lists). This algorithm searches an m-node table, 
looking for a given key K. If K is not in the table and the table is not full, K is inserted. 

The nodes of the table are denoted by TABLE[i], for 0<i<m, and they are of two 
distinguishable types, empty and occupied. An occupied node contains a key field KEY[.i], a 
link field LINK[i], and possibly other fields. 

C1. [Hash.] Set i <- h(K)+1. (Now 1<i<m.) 

C2. [Is there a list?] If TABLE[i] is empty, go to C6. (Otherwise 1ABLE[i] Is occupied; we 
will look at the list of occupied nodes which starts here.) 

C3. [Compare.] If K = KEY[i], the algorithm terminates successfully. 

C4. [Advance to next.] If LINK[i] ^ null, set i <- LINK[i] and go back to step C3. 

C5. [Find empty node.] (The search was unsuccessful, and we want to find an empty 
position in the table.) Decrease r one or more times until finding a value such that TABLE[r] 
is empty. If r = 0, the algorithm terminates with overflow (there are no empty nodes left); 
otherwise set LINK[i] ♦- r, i *- r. 



C6. [Insert a new key.] Mark TABLE[i] as an occupied node, with KEY[i] «- k and LINK[i] 
*• null. I 



Chained methods require more storage because of the LINK fields but, as the analyses in 
[Knuth2] show, they usually outperform open addressing techniques with respect to number 
of probes for given m and n. 



Note that all algorithms we have presented handle both lookup and insertion. 
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1.3. Algorithmic Analysis. 

We are concerned with analyzing tlie performance of algoritiims such as those presented in 
the previous section. A discussion of how the analysis of specific algorithms relates to 
computational complexity is given in [KnuthS]. We first have to define the cost measure by 
which we will evaluate the performance. The two usual cost measures are the space and 
time consumed by the algorithm. In the algorithms we consider, the space cost will be either 
fixed or trivially computable, except for one case which we analyze in detail. In order to 
make our time costs implementation independent we will use the number of probes made 
during a lookup as our basic cost function. This accounts, however, for only part of where 
the running time of a hashing algorithm is spent. The computation of the hash function(s) is 
another significant component. In comparing algorithms we cannot always factor this 
component out, as double hashing, for example, uses two hash function computations per 
search, vs. only one for linear probing. Having made this caveat we now strictly confine our 
attention to the number of probes made. 



With any hash function it can happen that all the keys we insert will select the same probe 
sequence. In this unfortunate situation all the algorithms of the previous section reduce to a 
linear search of the table. Thus the worst case of hashing methods is not very interesting. 
We will be concerned with performance on the average. Before we can make precise the 
notion of the average number of probes, we need to specify the probability distribution of 
the inputs to our algorithms. We assume that every one of the hash functions we use will 
select each of its allowed values with equal probability, independently of all the others. 
Thus for Algorithms L, A, and C we will assume that h(K) = s (0<s<m-1) with probability 
1/m. For double hashing we will take m to be prime and then assume that (h^(K),h2(K)) = 

(i,i) with probability 1/m(m-1), for all (i,j) with 0<i<m-1, 1<j<m-1, i:;*]. 

We now specify what we mean by the number of probes a bit more carefully. Let us 
consider the insertion of a new record. For open addressing techniques we will include in 
our count the very last probe in which an empty position was discovered. The other probes 
correspond to comparisons between keys. To avoid monotony of language we will use the 
terms probe and comparison interchangeably, even though this is misleading when it comes 
to the last probe. For chained techniques we will count one probe if we discover that the 
list is empty, and otherwise we will just use the number of list elements examined (i.e., in the 
case of insertion, the length of the list). 
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We clearly need to distinguish a successful from an unsuccessful search. We will measure 
the performance of a hashing algorithm by the following two quantities: 

DEFINITION 1.3.1. Given any hashing algorithm we define C'^ to be the average number of 

probes made into the table when the (n+1)-st record is inserted (unsuccessful search). We 
include in this count the very last probe that discovered the empty position in an open 
addressing technique, and count one for discovering an empty list in the case of chained 
hashing. We assume all hash functions involved to choose each of their allowed values 
independently with equal probability. 

Similarly, C^ will denote the average number of comparisons (or probes) made in a 
successful search using the algorithm, when the table contains n records. For C^ we assume 
that we are equally likely to look up any record present in the table. 

In an open addressing technique it is clear that the number of comparisons required to look 
up a specific record is the same as the number of probes made when that record was 
inserted. This observation implies that 

n-1 
Cn = (1/n) S C'i 
i=0 



Thus in open addressing C^ is just an average C'^. For this reason C'^, will be the principal 
quantity we investigate for such algorithms. 

The quantities C^, C'^ naturally also depend on m, the table size. We will find that a 

convenient way to express the answers we seek is in terms of the load (or occupancy) 
factor a of the table, where a = n/m. In several cases we will be unable to obtain C^, C'^ as 

closed form expression of n, m. But in these cases we will still be able to obtain formulae 
for C'^ and C^ as functions of a (and possibly m) that are asymptotically valid. That is, as 

the table size m gets large, if the load factor a, 0<a<1, stays fixed, these functions of a will 
differ from the true values by errors of the order of 0(1/m), and which therefore rapidly 
decrease as m increases. In terms of the load factor we may write C^, C'^ rather than C^,, 
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C'^. In this "continuous" approximation the above relation between successful and 
unsuccessful searches for open addressing becomes 



Ca = (1/«) Jo" C'a ^« 



We will have occasion to appreciate the power of this notation in the examples of Section 
1.5., as well as in the following two chapters. 



1.4. Clustering. 

Since we are interested in the performance of hashing algorithms, we might ask the following 
question: what is the probability that two keys will follow exactly the same probe path? We 
can expect that the higher this probability, the more will different keys interfere with each 
other, and therefore the worse the performance of our algorithm will be. This interference 
phenomenon we will generally refer to as clustering. For example, in linear probing the 
probability that two keys will follow the same probe path is identical to the probability that 
they will hash to the same location, which is 1/m. In double hashing this probability is easily 
seen to be 1/m(m-1). Thus we expect double hashing to have smaller C'^ (and C^) than 

linear probing, as is indeed borne out by the analyses. 

Another way to appreciate the effect of clustering is by observing that (loosely speaking) 
configurations of occupied positions that have a relatively high C',^ grow with a higher 
probability than configurations with a low C'^. For example, in linear probing a long block of 
contiguous occupied positions gives us a large contribution to the total C'^. During the next 

insertion the probability that such a block will grow by one is proportional to the length of 
the block. Thus long blocks grow into even longer ones with higher probability than short 
ones. This "pile-up" effect accounts for the rapid increase in C'^ for linear probing as a -♦ 

1. Similarly, in double hashing the configurations that contribute greatly to the mean C'^ are 

those that contain a large number of arithmetic progressions among the occupied positions. 
In general the probability that a given empty position will be filled during the current insertion 
is proportional to the number df arithmetic progressions coming from the occupied positions 
to that empty position. Here we have made the convention that we have m-1 arithmetic 
progressions of length 0, so as to properly account for the probability of hitting our position 
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on the first probe. Thus in double hashing, sets of occupied entries with an excessive 
number of arithmetic progressions will tend to grow into sets with even more progressions. 



The connection between clustering and C'^^ leads us to Introduce a new family of classes of 

hashing techniques, those that exhibit secondary, tertiary, and in general k-ary clustering 
([Knuth2]). A hashing technique Is said to exhibit secondary clustering, if the search into 
the table begins with one random probe, and then follows a fixed permutation which depends 
only on the location of this first probe. A hashing technique is said to exhibit tertiary 
clustering if it begins with two independently random probes into the table, and then probes 
the remaining table positions in a fixed permutation that can depend only on the locations of 
those first two probes. And in general a k-ary clustering technique begins the search in the 
table with k independent random probes and then continues along a permutation that 
depends on the locations of these first k probes only. (It is unfortunate that our terminology 
is somewhat inconsistent: secondary clustering is 1-ary clustering, tertiary is 2-ary; we have 
maintained the terms secondary and tertiary for historical reasons, while for the analyses in 
Chapter 2 the above meaning of k is the natural one.) Thus linear probing exhibits secondary 
clustering, whereas double hashing exhibits tertiary clustering. More formally, we can think 
of a secondary clustering technique as being specified by an m x (m-1) matrix, where we 
think of the rows of the matrix as indexed by {0,1,.., (m-1)}, and the row corresponding to i 
is a permutation of {0,1,...,(m-1)} - {1} which specifies the order in which the remaining table 
positions are to be probed. Thus for linear probing we have the matrix depicted by Figure 
1.4.1. 



Similarly, a tertiary clustering technique is defined by an (m(m-1)) x (m-2) matrix, where we 
think of the rows as indexed by (i,j), 0<\^\<m-^ and row (i,j) specifies in which order to 
probe the remaining m-2 table positions when we make our first probe at i and our second 
probe at j. Thus the matrix corresponding to double hashing (assuming that m is prime) is as 
shown in Figure 1.4.2., where the rows specify the arithmetic progressions to be followed in 
the search. 



It is convenient to introduce at this point an open addressing technique that exhibits "no 
clustering", namely uniform hastiing {or probing). Uniform hashing has the property that after 
n keys have been inserted, all C(m,n) possible subsets of occupied positions are equally 
likely. To achieve this we first probe the table at h^(K), then at h2(K) where h2(K) =^ h^(K), 

then at h3(K) where h3(K) "^ h^(K), h2(K), and so on. Here each hj is assumed to select each 

of its allowed values with equal probability, independently of all the others. This method is 
certainly of no practical interest, since we have to compute arbitrarily many independent 
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Figure 1.4.1. THE MATRIX FOR LINEAR PROBING 




1 

m-1 



m-1 


m-2 


• » « ■ 


1 





m-1 


■ a » • 


2 



m-1 



m 



m-2 




.... 






Figure 1.4.2. THE MATRIX FOR DOUBLE HASHING 
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hash functions. On the other hand it is of theoretical importance, since Ullman has proved 
that no other open addressing technique can have a smaller C'^ for all n ([Ullman]). Thus 

the performance of uniform hashing, which we will compute in the next section, can be used 
as a benchmark against which to measure the success of other open addressing techniques. 

The notion of clustering can also help us understand why we wisn to make our hash function 
h uniform, i.e., to make it equally likely to hash to any table entry. Suppose we are dealing 
with a technique with secondary clustering and let Pj denote the probability of hashing to 

entry i, 0<i<m-1. Then the probability that two keys will follow the same probe path Is 

m-1 

P i 

i=0 

which, since Sq<;j<j^ Pj = 1, is clearly minimized by setting 
Po = Pi = - = Pm-1 = '^^^■ 



1.5. Some Sample Analyses. 

For some simple hashing algorithms we can compute the average number of probes required 
for a successful or unsuccessful search directly, using only elementary combinatorial 
mathematics. In this section we will analyze uniform hashing and bucket search. We will 
also present without proofs the results of the analyses of linear probing and Algorithm C; 
detailed arguments can be found in [Knuth2]. These analyses will be a useful introduction 
to the techniques of the next two chapters. The results will also be useful for future 
reference. 

We start with uniform hashing. Let p^^ denote the probability that we need r probes to insert 
the {t+1)-st element into the table. We have 

C't = 2 r ptr- 
1<r<t 
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It is often easier to compute this average another way, namely 

1<r<t 



where q^^ = p^^+p^^^^+... is the probability that at least r probes are needed. It is easily seen 
that, since all probes are random and independent, 

t (t-1) ... {t-r+2) (m-r+1)! (m-t)! tl 

% = - = - = 

m (m-1)...{m-r+2) (m-t)! (t-r+1)l m! 



= C(m-r+1,m-t)/C(m,t). 



Hence 



C't = 2 Qtr = 1/C(m,t) 2 C(m-r+1,m-t) = 

1<r<t 1<r<t 

(by 1.2.6-(10) of [Knuthi]). 

= C(m+1,m-t+1)/C(m,t) = 1 + t/(m-t+1). 



Thus, 



C'n = 1 + n/(m-n+1). 



For C^ we have 



Cp = 1/n 2 C't = (m+1)/n 2 1/(m-t+1) 
0<t<n-1 0<t<n-1 
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(m+1)/n 2 1/t = ((m+1)/n) (H^+i " H^-n+l)- 

m-n+2<t<m+1 



Here H, denotes the harmonic number 2^<;^<^j 1/k. In terms of a, we can write these 
results as 

C'„= 1/(1-a) + 0(1/m), 

C^ = (1/a) log 1/(1-a) + 0(t/m), 

where in the latter we have used the fact that 
Hj = log i + Y + 0(1 /i), 

with 7 being Euler's constant ([Knuthi]). We see that the load factor has led to simple 
expressions. The relation 



C„ = 1/« Jo" <='« "« 



is evidently true. 



n 



Next we handle bucket search. We assume of course that all possible m input hash 
sequences are equally likely. Let p^^,^ denote the probability that a given list has length k. 

There are C(m,k) ways to choose the set of keys in our sequence that will hash to the given 

n-k 
list, and (m-1) ways to assign hash values to the other keys. Therefore 



Pnk = C(n'*^) (»^"'')" Z"^"- 



If we introduce the generating function ([Knuthi]) 
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Pn(z)= S PnkZ^ 
k=0 



then 



Pn(z) = S C(n,k) ((m-1)"''^/m")z'^ = (1 + (z-1)/m)" 
k=0 

by the binomial theorem. Note that 

Fn(1) = n/m, P"n(1) = n(n-1)/m^. 



We now have 



Cn= 2 (k+5k0)Pnk = P'n(1) + Pn(0)- 

k=0 f 



counting one probe for discovering an empty chain 

Thus 

C'n = n/m + (1-(1/m))". 

For Cj^ consider the total number of probes to find all keys. A list of length k contributes 
C(k+1,2) to the total; hence 



Cn= (m/n) 2 C(k+1,2) p^^ 

k=0 

(m/n)(»/4 P"n(1) + P'nC)) = ^ + (n-1)/2m. 



Asymptotically then we have 
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C'„ = a + e'" + 0(1/m), 
C„ = 1 + a/2 + 0(1 /m), 
(Note that these results are valid also for a>1, since n>m is possible in this method). 

In Section 6.4 of [Knuth2] it is shown that Algorithm C leads to 
C'^ = 1 + (e^" - 1 - 2a)/4 + 0(1/m), 

C„ = 1 + (e^" - 1 - 2a)/8a + a/4 + 0(1/m), 

which is only slightly worse than the performance of bucket search. The analysis of linear 
probing is a considerably more difficult problem. Knuth shows that 

C'„ = (1 + 1/(1-a)^)/2 + 0(1/m), 
C„ = (1 + 1/(1-a))/2 + 0(1/m), 



which is always worse than the performance of uniform probing, and much more so for load 
factors a near 1. The graphs in Figure 1.5.1., reproduced from [Knuth2], summarize these 
analyses. Algorithm S on these graphs refers to bucket search. We will have much more to 
say about the graphs for U2 and D in the following two chapters. 

It is interesting to note that in all these hash methods, if we fix the load factor a, then the 
average number of probes is nearly independent of the table size m. Thus, loosely speaking, 
hashing allows us to retrieve records with a bounded number of probes, irrespective of the 
number of records we have. This is why it is so convenient to express the performance in 
terms of the load factor a, which is independent of m. It is also one of the reasons why 
hashing algorithms enjoy such popularity in software practice. 
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Figure 1.5.1. COMPARISON OF COLLISION RESOLUTION METHODS: 
LIMITING VALUES OF THE AVERAGE NUMBER OF PROBES AS M -> co 
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1.6. Summary of the Results 

The purpose of this thesis is to investigate and illustrate techniques for the analysis of 
hashing algorithms. The traditional tools of algorithmic analysis are the tools of classical 
combinatorial enumeration: special numbers (i.e., binomial coefficients, Fibonacci numbers, 
etc.), recurrence relations, and generating functions. In Chapter 2 we apply these tools to a 
number of different hashing algorithms. In Sectio/i 2.1 we present an extension of Algorithm 
C that handles overflows with reasonable efficiency. In Section 2.2 we analyze a method 
combining open addressing and chained hashing; we prove that it is not as advantageous as 
it seems at first sight. Section 2.3 discusses k-ary clustering techniques for which C'^ is as 

large as possible and in fact turns out to grow linearly with m. The method used in the 
analysis indicates the power of the falling factorial power notation. The oracular argument by 
which the extremality of these techniques is proven is also of interest, as one of the few 
examples available where a certain algorithm in a class can be proven extremal on the 
average. We can see from the results of that section that only when k grows as 12(logm) will 
we be able to maintain C'^ bounded for the worst k-ary clustering techniques, as m ->■ oo. 

On the other hand Section 2.4 shows that on the whole (i.e., averaged over all techniques) 
k-ary clustering techniques for k>t are quite good. We prove that if the permutations 
described in the definition of k-ary clustering for k>1 are randomly chosen, then C'^ is 

asymptotically 1/(1 -a), the same as for uniform probing, which exhibits no clustering. We 
also analyze "random" secondary clustering (k=1), in which case we find that C'^ is 

asymptotically 1/(1 -a) - a - log(l-a). Thus secondary clustering techniques on the average 
are worse than tertiary (since a + log(l-a) < 0), although better than linear probing. 

In Chapter 3 we exclusively concern ourselves with the analysis of double hashing. It has 
long been known from simulations that double hashing behaves essentailly identically with 
uniform probing. We prove that for 0<a<aQ, where Uq is an absolute constant, Cq ~ .319, 

this is indeed the case: C'^ for double hashing is 1/(1 -a) + 0(1). This result is 

considerably deeper than any of the other analyses we have carried out and the techniques 
we use have a different flavor. We cannot appeal to recurrence relations or generating 
functions. Instead we use a probabilistic argument to prove that the configurations of am 
occupied positions that double hasing gives rise to have almost always nearly the expected 
number of arithmetic progressions, and thus nearly the expected C' . 



All of the results presented in Chapters 2 and 3 of this thesis are new, with the exception of 
the analysis of random secondary clustering that was previously done using a more 
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complicated method in exercise 6.4-44 of [Knuth2]. Perhaps the most significant 
contribution is the method used in the analysis of double hashing, that allows one to prove 
the asymptotic equivalence in the performance between an algorithm and a simplified model 
of its behavior. 



There are many open problems relating to the material of both chapters. Are there secondary 
clustering techniques with C'^ better than 1/{1-a) - a - log(1-a)? If we let h2(K) = f(h^(K)) 

in double hashing, are there simple f (e.g., f(x) = m-x) for which the resulting algorithm is 
asymptotically equivalent to random secondary clustering? (Simulations support this 
contention for the f given in parentheses.) Are there any open addressing techniques with 

C'^ < 1/(1 -a) for almost all a? Most of these questions correspond to open problems in 

Section 6.4 of [Knuth2]. Perhaps we can prove that "almost all" k-ary clustering with k>1 
techniques have a C'^ which is (1±£)/(1-a). For Chapter 3 the most obvious next step is to 

extend the argument to work for all a, 0<a<1. The argument also can be applied to a 
modified double hashing algorithm, in which h2(K) is restricted to a linear segment of the 

table of size Am, for any fixed \, 0<X<1. The number of probes in the modified algorithm 
can be proven to be asymptotically equal to that of double hashing. This modified algorithm 
allows us to handle tables of non-prime size. Perhaps we can prove this for h2(K) restricted 

in any subset of size Am. A number of purely number theoretic questions about arithmetic in 
the finite field Z^ of m elements also arise (m is prime). We make the following two 
conjectures (1) Let /be fixed, 0</<1/2, S = {1,...,m), T any m -element subset of Z^, ST = 

2 he 

fst I s€S, t€T}. Then as m -»■ oo, there exists a small constant e such that |ST|>m ; (2) if 

V2 2 "~ 

0<x<m , then no set of x elements of Z^ can have more than 0(x /k) arithmetic 

progessions of length k among its members, for any k = 1,2,...,x. Settling either of these 
conjectures would prove double hashing equivalent to uniform hashing for all a. 



The appendix at the end of the thesis illustrates a rather curious application of the 
exponential sums technique of analytic number theory to the problem of enumerating 
arithmetic progressions, which is treated by more combinatorial methods in Chapter 3. 
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"cogito ergo sum" = "/ think, therefore I sum", 
graffiti in CMU Science Hall restroom 



CHAPTER 2: 

RECURRENCE METHODS 



In this chapter we present four different hashing analyses. These fall into 
the traditional paradigm of algorithmic analysis, that is they Involve either 
direct counting or the recurrence relation/generating function techniques 
expounded in [Knuthi], [Knuth2]. In these algorithms we are fortunate to 
have probability distributions varying so regularly with the number of 
records in the table, that the desired performance either satisfies a closed 
form recurrence or can be explicitly calculated. The first two sections 
analyze two algorithms that allow one to handle overflows from a hash 
table gracefully. These methods work best if the number of records 
inserted is below a certain bound, but also handle larger quantities of 
records with only a moderate degradation in the performance. The last 
two sections discuss secondary, tertiary, and in general k-ary clustering 
techniques. Section 2.3 shows that the worst k-ary clustering techniques 
have an average number of probes linear in the table size. No fixed 
number of independent probes into the table can ever be sufficient by 
itself to guarantee sublinear performance. The subsequent probes are of 
paramount importance. In Section 2.4 the average performance over an 
entire class of hashing methods Is computed. Perhaps most Interesting is 
the result that a "random" k-ary clustering technique is asymptotically 
equivalent to uniform probing for k>1. 
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2.1. Chained Scatter Searching with Overflow. 



A common way of resolving collisions in hashing is to chain, or link, together all records 
whose keys hash to the same address. When a record is to be looked up, the search begins 
at the table entry where Its key hashes. If this entry is empty, then the record is not in the 
table and can be inserted right there. Each full entry in the table contains a link field which 
can be a pointer to another entry, or null, Indicating the end of a chain. If the original entry in 
the table we came upon was full, and the record sitting there was unequal to our record, 
then this link field indicates the next entry to try. We keep repeating this process until 
either (1) we find the record we are looking for, or (2) we run off the end of the chain we 
are following. \n case (2), we know the record is not in the table. In order to insert it, we 
first find an empty entry (or give up, if the table is full). Then we place the record there and 
plant a link to that entry in the last member of the chain we just exhausted. This completes 
the lookup process. 



The above is an informal description of Algorithm C {Chained scatter table search and 
insertion) in Section 1.2. This algorithm is analyzed in exercises 6.4. 39-42 of [Knuth2]. 
The key results of the analysis are as follows: Let m be the total number of entries in the 
table and let n be the number of records currently in it. Assume that all m" possible input 
hash sequences (i.e., sequences of possible hash values for the keys of our n records, in 
order of their insertion) are equally likely. Assume also that it is equally likely that we will 
search for any of the records presently in the table. Then the average number of 
comparisons in an unsuccessful search C'^^ is given by 

(1) C'n = 1 + (d + 2/m)" - 1 - 2n/m)/4 = 1 + (e^" - 1 - 2a)/4 + 0(1/m), 

where a - n/m = occupancy factor of the table. For a successful search, the average 
number C^^ of comparisons is 

(2) Cp = 1 + (m/8n) ((1 + 2/m)" - 1 - 2n/m) + (n-1)/4m 

= 1 + (1/8a)(e^" - 1 - 2a) + a/4 + 0(1/m). 



We proceed below to analyze a modification of Algorithm C that uses a table of size m' > m. 
Only the first m locations are used for hashing, so the first m'-m overflow items will go into 
the extra locations of the table. For fixed m', what choice of m in the range 1 <m<m' leads 
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to the best performance? 

We will need a number of facts from the answer to exercise 39 of Section 6.4 of [Knuth2]. 
A key quantity in the analysis of Algorithm C Is the function c(k^,k2,...,kf,), which denotes the 
number of possible hash sequences of length n that give rise to k^ chains of length 1, k2 
chains of length 2, ...etc., in the table. Note that 1k^ + 2k2 + 3k3 + ... = n. These numbers 
satisfy a nice recurrence which can be used to show, among other things, that 



(3) 2 kjj^cCk^.kg,...) = 1/2 (m(m+2)" - m""^^). 

k^+2k2+... = n 



To proceed, we observe that the new algorithm we propose, call it Algorithm C, behaves 
exactly like C, until the first m locations of the table are filled. Let the vector (k^,k2,k3,...) 

represent the chain size distribution when this occurs (i.e., kj = # of chains of length i, 

k^+2k2+3k3+... = m). Suppose that subsequently p overflow items have been inserted 

(p<m'-m). Each of these overflow entries will be part of a chain that originates in the 
regular hash area. Let the vector {I^,l2'...,l^) denote the distribution of these items among the 

chains of various lengths, i.e., /j = # of overflow items that have been "hung" from a chain 

that had length i when the regular part of the table was just filled up. Clearly, 1^ + 1^ + Ir^ + ... 

= p. The following important observation states that the vectors Is and ./ are adequate 
information for obtaining the average behavior of Algorithm C 

Observation . As far as the average number of comparisons in an unsuccessful search is 
concerned, how the /,• overflow items (that got hung from chains of length i) are distributed 

among these kj chains is immaterial. 



Proof: We assume, of course, that an unsuccessful search is equally likely to originate^ in 
any of the m hash locations. Now if r overflow items get added to the tail of an i-chain, the 
effect Is simply to increase the number of comparisons for an unsuccessful search by r, for 
each such search originating at one of the elements of the i-chain. Thus the total number of 
comparisons for all possible unsuccessful searches originating in that chain is incremented 
by ir. This is a linear function in r, and therefore, the total number of comparisons does not 
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depend on the exact distribution of overflow items among chains of the same length. I 

(Thus we can assume, without loss of generality, that all /j overflow entries that have been 
attached to chains of length I have, in fact, been attached only to one of them.) 

Now a chain with t regular elements and s overflow elements attached to its tail contributes 
to the total number of comparisons for a// possible unsuccessful searches 

(4) (t+s) + (t-1+s) + ... + (1+s) = t(t+1)/2 + ts. 

as illustrated in Figure 2.1.1. 

As we noted, there are c(k^,k2,...) hash sequences of length m that give rise to a chain length 
distribution k = (k^,k2,...). In how many ways can each such sequence be extended to one of 
length m+p, that gives rise to an overflow item distribution _/ = (/^,/2,...)? 

Each of the I. items that go to chains of length i must hash to one of the ikj regular table 

locations occupied by these chains. This, together with the fact that the order in which the 
overflow items are inserted is unimportant (as far as _/ is concerned), shows that there are 
exactly 



m 
C(p; /i,/2,-.,/m) n (ik/i 
i=1 



ways to make the extension demanded above. 



Using (4), we can now compute the total number of comparisons for all unsuccessful 
searches after p overflow items ha> 
hash sequences. This number is 



searches after p overflow items have been inserted, summed over all m'"'*^'^ possible Input 
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Figure 2.1.1. 
A CHAIN WITH REGULAR AND OVERFLOW ELEMENTS 
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m 
(6) S ((ki-1)i(i+1)/2 + i(i+1)/2 + iM C(p; /i,/2,...) R (jkj)'i cCk^.kg,...) 

/^ + /g +... = p ^ ^ 

k^ +2k2 +... = m putting all /j 

overflow items 
on one of the kj 

i chains. 

We will use the following algebraic identity: 

m 
Za i'i C(p; /-(,/2,...) n X|i = p(x^+2x2+...mx^)(x^+X2+...+x^)'^ . 

i>1 j=1 

(Hint: /jC(p; .J,..) = pC(p-1; .../j-1...)). 

So (6) can be rewritten as 

m 
2( S C(p; /i,/2.-) n (jkp'i) ) kj 1(1+1 )/2 c(ki,k2....) + 

i>1 /^+/2+... = p j=1 

k^+2k2+... = m 

E2 2 2 D-1 

p(1 k^+2 k2+...+m k^) m*^ c(k^,k2,-) = 

k^ +2k2 +... = m 



m 



P 2 kj i(i+1)/2 c(k^,k2,...) + pmP"^ 2 kj i^ c(k-,,k2,."). 



i>1 i>1 



k-, +2k2 +... = m k^ +2k2 +... = m 



It is now not difficult to see that the left sum is simply the total number of comparisons for an 
unsuccessful search when the regular part of the table has been filled. Its value can be 
conveniently computed from (1) (let m = n and multiply by m'""*" ). Similarly, the right sum Is 
given to us by (3). To obtain the average number of comparisons for an unsuccessful 
search in Algorithm C, we need to divide the above expreslon by m""*'^'*" (namely m"""*"^ to 
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account for all hash sequences, times m to account for all initial hash values of our particular 
search). Thus C is 

(6) C' = (1 + 2/m)"^/4 + 1/4 + p(1 + 2/m)"^/2m - p/2m. 



Note that when p = 0, this formula agrees with (1) if we set n = m, as it should. When m = 1, 
the above formula says that C = 1+p which is certainly true, for we then have a chain of 

p+1 records linked together, with every unsuccessful search necessarily starting at the head 
of the chain. 

We can also observe that as m -► oo, (1 + 2/m)'" tends to e^. In fact, (1 + 2/m)'^ increases 
monotonlcally to e , though rather slowly, the difference e - (1 + 2/m)'" being of the order 
m . Thus for m sufficiently large the above formula can be approximated by 

C'p :=; (e^ + 1)/4 + p(e^ - 1)/2m :=^ 2.097 + 3.194(p/m). 



This formula gives us a rough idea of the degradation of search efficiency with increasing 
size of the overflow area. In a moment, we will have more to say on this point. 



Let us also determine the average number of comparisons in a successful search by 
Algorithm C. We will assume that each of the m+p table entries is equally likely to be 
sought. The key observation is that a successful search for a record uses one more 
comparison than the unsuccessful search made when the record was first inserted. There is 
one exception to this, namely when the record originally hashed to an empty location. Then 
we count one comparison for both the successful and unsuccessful searches. Thus the 
desired average is given by 



m m m+p 

Cp = (m + p - 2 (m-k)/k + S C',^ + 2 C',^), 
k=0 k=0 k=m+1 

from (1) from (6) 



or 
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(7) Cp = ((1 + m((1 f 2/m)"^/8 - 3) + m/4 - 1/4) + 

+ p + ((1 + 2/m)"^ - 1)p(p+1)/4m)/(m+p). 

Note that when p = 0, this agrees with (2) if we set n = m. If m = 1, (7) gives {p+2)/2 which 
certainly is the average number of comparisons in a successful linear search down a list of 
length p+1. 

If we use the fact that p+m = m', we can rewrite (6) 'as 
Cp = (mV2m - 1/4) ((1 + 2/m)"^ - 1) + 1/2. 

Keeping m' fixed, we can now ask for the m in the interval 1 <m<m' that minimizes Cp. The 

approximation to e used above suggests that this occurs when m = m'. (p/m = 0, so C = 

2.1). It can be shown analytically that m' is a local minimum for m' sufficiently large: Briefly 
we have 

(mV2{m'-Am) - 1/4) ({1 + 2/(m'-Am)'^'"'^"^-l) + 1/2 = 
(1/2 + Am/2m' -1/4) ({1+2/m')'^' - 2Am/m'^ - l) + 1/2 = 

m' Q 

C' + Am(1+2m') /2m' - Am/m' + higher order terms. 

^ * •-'~\^ /~^ 

This quantity positive for m'>1. 
This shows that for m' large, a small decrement in m (= m') will increase Cp. 

A number of empirical tests were run on a computer for fifty values of m' between 10 and 
1000. In ail cases, the minimum occurrred for m = m'. 



The above results establish that, from a time-efficiency viewpoint, there is nothing to be 
gained over Algorithm C by adding the overflow area. 

Algorithm C, however, could still be useful in a situation where the number of entries to be 
put in the table only rarely exceeds a certain bound. In this situation, the storage allocated 
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permanently to the table can equal the usual bound. Aditional storage need not be committed 
until really required, for Algorithm C uses the overflow area only after the regular part of the 
table has been filled. And, as we saw in the above analysis, for any reasonable sized 
overflow area, the performance of our algorithm suffers only moderate degradation. 



2.2. A Combined Open Address and Chained Hash Search. 



In this section we offer an analysis of a hashing technique originally proposed by D. G. 
Bobrow. The algorithm is described in [Bobrow] as follows: 



"...the following nonhomogeneous algorithm seems to combine the 
best of both [open address and chained techniques]. In open 
address search, two keys are forbidden: (say) which indicates an 
empty slot, and -1 (say) which indicates a deleted slot. In both these 
cases the value associated with the key is meaningless. If now 
instead of one area we allocate two, the first a one-shot (no 
collision) open address hash table, and the other a linearly allocated 
chain linked area, then we can use the first for open addressing. 
However, if there is a collision at an address then the pair at that 
address is moved to the linear collision (overflow) table. Then the 
key -1 is used to indicate that the value in the open address table 
corresponding to this key is a pointer into the overflow area. 



Thus when the table is mostly empty it acts as an open address 
scheme, and when it is almost full it acts like a bucket search. 
Ordering can be used on the overflow lists to cut down search times 
for failure. The collision case obviously takes exactly one more 
probe than it would in the chain hash search [to discover it is the 
chain case] ..." 



Bobrow's technique uses a fixed hash table and an overflow area that is allocated 
dynamically. In its simplest form the technique maintains the overflow area as a set of 
disjoint linked lists, each list consisting of keys that have hashed to the same address of the 
fixed table. In case of a collision both colliding entries are moved to the overflow area and a 
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special key (say -1) is stored at the h(K)-th entry, indicating that the value field of that entry 
Is a pointer to the beginning of a list In the overflow area. Thus records are stored in the 
fixed table until a collision occurs at their position, in which case they are moved to the 
overflow. For convenience of description we shall term this method Algorithm B in what 
follows. 



In the analysis of Algorithm B the following quantities are important: 

m = size of fixed table (number of entries), 

n = number of records inserted so far, 

a = n/m = the load factor of the table 

k = size of the key field of an entry (say in words), 

V = size of the value field of an entry (clearly If v is big it is best to store 
only a pointer to it in the table itself), and 

/ = size of the link field of an entry. 

Algorithm B assumes that v>/ and uses m(k+v) words of fixed table and k+v+/ words per 
entry in the overflow area. One of the desirable properties of this algorithm is that no link 
field need be allocated in the fixed table; thus if the overflow area is only rarely used, 
Algorithm B offers us the storage economy of an open address technique, together with the 
speedy search of a chained method. We will use the quantity 

fi = //(k+v+/) 

to represent the average savings factor of our technique. 



In the analyses below we assume that all m" possible sequences of the hash values of the n 
inserted keys are equally likely. 



Perharps the most interesting of the computations is that of the average size of the overflow 
area in Algorithm B. The number of entries In the overflow area equals the number of keys 



Recurrence Methods, Page 31 



inserted at whose hash address at least one collision has occurred. Let p(m,n,k) represent 
the probability that exactly k non-special (i.e., ^ 0,-1) keys are present in the fixed table. In 
this event the overflow area has size n-k. 



A given sequence of n keys to be inserted can be thought of as a mapping from a set of n 
objects (the keys) to a set of m objects (the possible hash adcJresses). Any such mapping 
defines a partition of its domain into disjoint classes, each class containing all keys hashing 
to the same address. In this model we can interpret p(m,n,k) as the probability that a random 
mapping has exactly k classes of size 1. Let Q(m,n,k) = m"p(m,n,k) = number of mappings 
with the given property. Then, by separating out the k elements that form the singleton 
classes, we obtain the recurrence 

(1) Q(m,n,k) = C(m,k) C(n,k) k! Q(m-k,n-k,0) 



The numbers Q(m,n,0) count the number of mappings with no singleton classes. By 
selecting one of the elements in the range of such a mapping we can get the recurrence 



(2) Q(m,n,0) = 2j C(n,t) Q(m-1,t,0) (1 - 8(n-t)i)- 

0<t<n 



Since we cannoi explicitly solve (1) and (2) we now proceed to introduce the exponential 
generating functions [Liu] 



Gm(z) = S C(m,n,0) z"/n! 



'm 

n>0 



and 



Gmk(2^) = S C(m,n,k) z"/nl . 
n>0 

We have G-j(z) = e^-z, because when m = 1 there exists exactly one mapping for all n, 
except when n = 1, in which case our condition is violated. 



Equation (2) translates to the recurrence for G 



m 
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G^(z) = Gn,.i(z) (e^-z) , m>1 



which coupled with the above remark implies that 



G^iz) = (e^-z)"^. 



From this and (1) we finally get the generating function we want 
Grnk(z) = C(m,k) z" {e^'z)"^'^. 



The average number of non-special keys present in the fixed table is 

Ap = ^ k p{m,n,k) = 1/m" Zj k Q(m,n,k). 

0<k<n k>0 



By interchanging the order of summation we obtain 

V "An., 'V I /^/ I \ k , z .m-k (m-1)z 

xj m ApZ /n! = ^ k C(m,k) z (e -z) = mze^ ' . 

n>0 k>0 

Therefore Ap, = n(1 - 1/m) , and so the average size of the overflow area is 

n(1 - (1 - 1/m))"'^ 

By similar manipulations we can compute the variance of the size of the overflow area to be 

Vp = 2j (Ap-k) p(m,n,k) = 

0<k<n 

= n(n-1)(1 - 1/m)(1 - 2/m)""^ + n(1 - 1/m)""^ - n^(1 - 1/m)^""^, 

2 

which is rather small (the n terms cancel). 

The computation of C^ and C'^, the average number of comparisons done in a successful or 
unsuccessful search respectively, proceeds along standard lines. 
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Let Aj = the average number of comparisons needed to find the t-th key that was inserted. If 
p^(r) denotes the probability that the list to which the t-th key hashes has r-1 entries in the 
overflow area, then 

Pt(r) = C{t-1,r-1) (1/m)''"'^ ((m-1)/m)*"'' and 

At = 2 {r+1)Pt(r) - pt(1) = 2 + (t-1)/m - ((m-1)/m)*'\ 

Thus 

Cp = 1/n S At = 2 + (n-1)/2m - (m/n)(l - (1 - 1/m)"). 
1<t<n 

Similarly for an unsuccessful search, let p'^Ck) = probability that the list we search has k 
entries in the overflow area. Then 

p'pCk) = C(n,k) (1/m)'^ {(m-1)/m)""^ and 

C'n= 2 {k+1)p'n(k) - p'^d) = 

k>0 

= 1 + (n/m)(l - (1 -1/m)""''). 

In summary, the asymptotic performance of Algorithm B is given by 

average number of comparisons 

for a successful search = 2 + a/2 - (1 - e ")/a, 

and 

average number of comparisons 

for an unsuccessful search = 1 + a(1 - e "), 

where the given values are in fact approximations with an error of the order of 0(1/m). 

Our storage analysis has proved that 

average number of entries 
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in the overflow area of 

Algorithm B= n{1 -(1 - (1/m))""^) = n(1-e"") + 0(1), 

and the distribution has a rather small variance, i.e. the the variance equals (e "+a)n + 0(1). 

In order to interpret these numbers we compare Algorithm B with two other methods, a pure 
bucket search, in which the fixed table consists only of list-heads, and Algorithm C. of 
Section 1 .2 in which there Is no overflow area and the lists are maintained in the fixed table 
itself. The bucket search uses m/ locations for the listheads and n(k+v+/) for the entries, 
whereas Algorithm C uses just a fixed table of size m(k+v+/). 

The two graphs of Figure 2.2.1. compare the average time spent in a search by each of 
these hash methods. Under the reasonable assumption that look-ups will be a lot more 
frequent than insertions, we conclude that the time performance of Algorithm B Is relatively 
disappointing. It appears that at every load factor a successful search by Algorithm B wil be 
more expensive on the average than one by the other two methods. This is so because an 
extra comparison with the special key (-1) will be involved in all successful searches except 
those succeeding in the first probe. The presence of this special key will also complicate 
the programming of the search loops in Algorithm B. 

The table below indicates, for various savings factors ^, the load factor a at which the 
average storage requirements of Algorithm B exceed those of Algorithm C. (It can be shown 
that for m>500 these load factors a vary with m only within a tolerance of 10 ). 

fi a 



.1 .344 

.3 .637 

.5 .864 

.7 1.00 

.9 1.00 



Figure 2.2.2. plots the storage requirements of these three techniques in the case k = v = / 
(so fi = 1/3). (Table size m = 1000 was used, but the relative magnitudes of the graphs are 
fairly independent of m.) The storage requirements of Algorithm B are sandwiched between 
those of Algorithm A and Algorithm C, with a cross-over point at approximately a = .67. Thus 
we can conclude that in the case k = v = / Bucket Search has a better running time and 
simpler progrm than Algorithm B, at the cost of only slightly worse storage economy at high 
load factors. Therefore Bucket Search is preferable, unless very special information 
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concerning the keys and the searches is given which allows us to conclude that for this 
particular application Algorithm B is more, desirable. 



2.3. The Worst k-ary Clustering Techniques 

In this section we will discuss k-ary clustering techniques whose average number of probes 
grows linearly with the table size. One obvious way to cause substantial clustering is to 
force the probe paths of all keys to be identical after the first k random probes have failed. 
The technique described below is a natural generalization of that presented in exercise 
6.4-46 of CKnuth2]. 

The hashing technique uses k distinct independent random probes, first at h^(K), the second 
at h2(K), where h2(K)=?^h^(K), etc., up to a probe at h,^(K). If we have collisions in all these 
probes, then we make a left to right scan of the table for an empty slot, skipping over any 
positions already examined. 

As usual we let m = table size and n = number of entries in the table. The key quantity is 

H(m,n,]) = number of possible hash sequences of length n 

[more precisely, sequences of n k-tuples (h^(K),h2(K),...h,^(K))] 

such that locations 1,2, ...,j of the table get occupied and location 
j+1 is empty. 

By considering the remaining n-j entries we obtain 
H{m,n,j) = C(m-j-1,n-j) F(m,n,j), 

where F(m,n,j) denotes the number of sequences of n k-tuples (h^(K),h2(K),...h,^(K)) that 

occupy positions 0,1,...,j-1,j+1,...,n (thus leaving position j empty). We observe that the 
numbers F{m,n,j) satisfy the following recurrence based on n: 

(*) F(m,n+1,j)= (f{m,n) + n-) 2 F(m,n,/) + (n+1-j)f(m,n)F(m,n,j), 

0</<j 
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where f(m,n) = (m--n~)/(m-n) = n-(m-1)^^ + n"(m-2)^^ + ... + n^m-k)-. 

(This last identity is left as an exercise to the reader). 

To see this we distinguish two cases: 

Case 1 : When n keys are present, position j is empty and all the other positions 0,1,...,j-1 
occupied. Now note that F also counts the number of hash sequences that occupy positions 
0,1,...,j-1 leave position j empty, and then occupy any fixed n-j positions among j+1,...,m-1. 
Before the (n+1)-st key is inserted, we must have a situation in which positions 0,1,...,j-1 are 
occupied, position j is empty, and among positions j+1,...,n+1 exactly one is empty. This 
empty position has to be filled with the next insertion. As there are n-j+1 such candidate 

positions, this can happen in (n-j+1 )(m-1) ways on the first probe, or (n-j+1 )n(m-2)^^ 

k-1 
ways on the second probe, ..., or (n-j+1 )n'^— on the last of the random probes. This covers 

all possible ways, and so we get the term (n+1-j)f(m,n)F(m,n,j). 

Case 2 : When n keys are present, position / is empty, / < j, positions 0,1 /-I are occupied, 

position j is empty, and positions /+1,...,j-1,j+1,...,n+1 are occupied. The (n+1)-st insertion 

has to fill position /. This can happen on the first probe in (m-1)^-" ways, on the second 

k-2 k"1 

probe in n(m-2)^- ways, ..., on the k-th probe in n*^- ways, and finally on succeeding 

k k 

probes (I.e. during the linear scan from left to right) in n'^ ways, for a total of f(m,n)+n ways. 

Summing this contribution over all / we obtain the first term on the right hand side of (*). 
The proper initial values for this recurrence are easily seen to be 
F(m,1,0) = F(m,1,1) = (m-1)^ 



If C'^ denotes the average number of comparisons to insert a new element in the table when 
n are already present, then clearly 



where 



C'n = 1 + 2 H(m,n,j)P(m,n,j)/(m-)", 
0<j<n 



P(m,n,j) = average number of collisions when inserting a random 
k-tuple (h^(K),h2(K),...h|^(K)) in a configuration of n 

occupied positions of the type counted by H(m,n,j). 
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The quantity P(m,n,j) can be broken down according to the number of collisions that occured 
during the random probes: 

Si i-f1 k k 

inVm — + (n-/m-)(j + k(n-'j)/n); 

0<i<k 
here j + k(n-j)/n is the average number of collisions affer all k random probes have failed. 



This sum can be computed in closed form by expressing the falling powers in terms of 
factorials and then in terms of binomial coefficients and finally appealing to the well-known 
identity 1.2.6-11 of [Knuthi]. Since the manipulations are similar to those illustrated below 
for the more important sum of the H's, they will be omitted. The result is 



k k 

nm- - (n + k(m-n))n- 

k k 
P(m,n,j) = - -- + (n-/m-)(i + k(n-j)/n). 

m-(m-n-l) 



Therefore 



k k 

nm~ - (n + k{m-n))n- 

C* = 1 + + kn-/m- + 



n 



m-(m-n-l) 

+ ({n-1)-/(m-)"'*"^ 2 JH{m,nJ). 

0<j<n 

We have used above the fact that 

2j H(m,n,j) = .(m-)", 
0<i<n 

but it will be useful for our purposes to verify this directly. We will use the recurrence 
relation (*) to prove this fact inductively. 

We have for n=1 
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S H(m1,j) = H(m,1,0) + H(m,1,1) = 
0<j<1 



(m-1)F(m,1,0) + F(m,1,1) = m.{m-1)^^ = m-, 



as desired. 



Now 



ZJ H(m,n+1,j) = ^ C(m-j-1,n+1-j)F(m,n+1,j) = 

i i 

2Li ^ C(m-1-j,n+1-j)(f(m,n)+n-)F(m,n,/) + 

J 0</<j 

^ C(m-1-j,n-j)(m-n-1)f(m,n)F(m,n,j) = 

] 

S F(m,n,/)[{f(m,n)+n-)2 C(m-1-j,n+1-j) + (m-n-1)f(m,n)C(m-1-l,n-/)] 
0</<m /<j<m 

2 F(m,n,/)C(m-1-/,n-/)[f(m,n) + n- + {m-n-1 )f(m,n)] = 
0</<n 

m- 2j F(m,n,/)C(m-1-/,n-/) = m- J^ H(m,n,j), 
0</<n j 

This completes the inductive proof. 

Next we evaluate Eq<^j<^p^ jH(m,n,j) in a similar manner. 
We have 

2 jH(m,n+1,j) = 2 jC(m-j-1,n+1-j)F(m,n+1,j) = 

] j 
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2 2 C(m-1-j,n+1-j)(f(m,n)+n-)jF(m,n,/) + 

J 0</<i 

2 C(m-1-j.n-j)j(m-n-1)f(m,n)F(m,n,J) = 

J 

2 F(m,n,/)[(f(m,n)+n-)2 JC(m-1-j,n+1-j) + (m-n-1)f(m,n)C(m-1-/,n-/)/] = 
0</<n /<j<m 

2 F(m,n,/)[C(m-1-/,n-/)(f(m,n)+n-)(m+(m-n-1)k))/(m-n) + 

0</<n 

(m-n-1)f(m,n)C(m-1-/,n-/)/] = 

[m- - (f(m,n)+n-)/(m-n)] 2 /H(m,n,/) + m(m-)"(f(m,n)+n-)/(m-n). 

0</<n 



Thus if we write 



Ap = 2 jH(m,n,j)/(m-)n, 

i 



then we have shown that 



l< k k k 

Ap+i = [1 - (f(m,n)+n-)/m-(m-n)] Ap + m(f(m,n)+n-)/m-(m-n). 



It is easy to show that A^ = 1/m. 



It is perharps best to write this recurrence as 

k k k k 

'^n+l^'^ " t"! ' (f(m,n)+n-)/m-(m-n)] Ap|/m + (f(m,n)+n-)/m-(m-n), 



from which it is easy to verify that 

n-1 
Ap = m - (m-1/m) 11 [1 - (f(m,i)+i-)/m-(m-i)] 

i=1 
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n-1 

n 

1=1 



m - (1+1/m)(m-n) 11 [1 + (1-i-/m-)/{m-i)]. 



(We interpret the empty product to be 1). 



Thus we finally have an explicit formula for C'^^, namely, 

nm- - (n + k(m-n))n- 
C'p = 1 + + kn-/m- + 

m-(m-n-1) 

n-1 
+ ((n-1)-/m-)[m - {1+1/m)(m-n) IT [1 + (1-i-/m-)/(m-i)]]. 

i=1 



Since this is a rather complicated expression, we proceed now to derive the asymptotic 
growth of C'^ for n = an, a fixed, m -> oo. Let us look at the asymptotics of the product. 
Taking logarithms we have, since f(m,i)/m = 0{1/m), 

- 2 log(1+(1-i-/m~)/(m-i)) = - 2 f(m,i)/m^ + 0(1/m) = 
1<l<n 1<i<n 

- 2 2 iVm^ + 0(1 /m) = - 2 2 iVm^ + 0(1 /m) = 
1<i<nO<j<k 0<j<k 1<i<n 

- 2 n^/(j+1)m^ + 0(1/m) = 
0<j<k 

- 2 «Vi + 0(1/m). 
1<j<k 

Substituting into the formula for C'^^ we see that 

C'n= aNl-(1-a)exp(-2^<^j<,^ aVj)) m + 0(1), as m -* oo. 
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Thus we can see that C'^ increases finearly with the table size m. The implied constant 
above can be made absolute If we restrict a to any interval 0<a<aQ, for any absolute 
constant aQ^^- ^^"^ ^® '^^^® 

THEOREM 2.3.1. A k-ary clustering technique that probes the table in a fixed sequence after 
the first k random probes have been exhausted (naturally ignoring any positions in this 
sequence already examined) has a performance that as m -* oo, n = am, 0<a<1, and a, k 
fixed given by 

C'n= aNl-(1-a)exp(-Si<j<k «^/i)) ^ + 0(1). 

Proof. Clearly the above argument applies for any fixed permutation of probing the table 
entries after the Intltial k probes. I 

From the above analysis we see that k has to get as high as fi(log m) before we can hope to 
cause these worst k-ary clustering techniques to have a bounded number of comparisons on 
the average, as m -* oo. 



We now prove that the above considered techniques are indeed the worst possible k-ary 
clustering techniques, that is they have the largest possible average number of probes per 
insertion. The oracular argument given below is interesting, as it is one of the few examples 
where we can prove certain algorithms extremal on the average. To prove this we will find it 
easier to prove a somewhat stronger result. 



THEOREM 2.3.2. For all fixed n, among all hashing techniques that begin the search into the 
table with k independent random probes and then continue the search in any manner 
whatsoever, no methods will exhibit a worse C'^^ than those which after the initial k probes 

follow a fixed permutation, skipping over any positions already examined. 



Proof. By just renaming the table positions it is Immediately clear that the performance of the 
the method is independent of which specific permutation we use, and thus the "worst 
performance" mentioned above is well defined. 



For the proof we shall think of a method under consideration as being defined by an oracle 
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which, when the initial k random probes occur, decides, on the basis of the size of the table, 
the key being inserted, and even perhaps by consulting random variables, which of the 
remaining empty positions to fill. 



Let us number the table entries 0,1,2,...,m-1 according to the sequence in which they are 
encountered during the final test search, i.e. during the insertion of the (n+1)-st key. Thus 
without loss of generality in the sequel we will be thinking of the final search as being a 
linear scan of the table from left to right. The number of comparisons made will then be the 
length of the leftmost contiguous group of occupied positions. 

The assertion of the theorem is equivalent to stating that the oracle O that always fills the 
leftmost available slot in the table causes the average length of the leftmost group of 
occupied positions to be at least as large as any other oracle. 



In each case the table gets filled by a sequence of moves which are of two kinds: either an 
R-move (random) in which an empty position is filled during the initial k random probes, or 
an O-move (oracle) in which an empty slot is filled by the oracle. Note that 

(1) At each moment each empty slot is equally likely to be filled 
by an R-move 

(2) At each insertion the probablility that it wil be an R- vs an O- 
insertion is independent of the oracle we are using. 



We will now represent the sequence of insertions by a string over the integers among 1,...,m 
and the character "O". The integer i will indicate an R-move that fills the i-th available 
position from the left. The "O" will indicate an O-move. For example, with a certain oracle, 
the following insertion sequence S 

5 6 O^ 3 O2 2 1 2 

might fill the table as follows (the subcripts of the R's are the corresponding R-moves; those 
of the O's are just used to distinguish the different oracle moves): 

IR1IR2IR3IO1IR5I IR6IO2IR2I ... 
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whereas our "worst" oracle O would produce the configuration 

IO1IO2IR1IR3IR5IR2IR6I IR2I ■•• • 

1 2 

The above remarks imply that, given any two oracles O and O the above way of 

representing Insertions is a 1:1 correspondence of the ways of filling the table with the two 

oracles such that the corresponding ways are equiprobable. 

The following inductive assertion will prove that our oracle O leads to an initial contiguous 
occupied group that Is at least as long as that produced by any other oracle O for each 
insertion specification sequence S, and thus prove the Theorem. 

Assertion : For each n, for each insertion sequence S, and for each i, 1<i<m-n+1, when the 
n-th insertion is made, the i-th empty position from the left will not be a lower numbered 
entry of the table if we were using our oracle O as compared with any other oracle O. 

The above example illustrates this principle. For the proof we induct on n. We can trivially 
check it for n=1. So all that remains is to verify that our assertion is preserved when the n-th 
key is inserted. Now if this new insertion is an R-move the effect is to simply take out a pair 
of corresponding positions in the two configurations, so the assertion clearly remains true. 
See Figure 2.3.1., part (a). 

If the new insertion is an 0-move, then again Figure 2.3. 1., part (b) makes it obvious why the 
assertion is preserved. 

Thus our proof of the theorem is complete, by applying the assertion for i=1. I 



2.4 The Analysis of Random k-ary Clustering Techniques 

In this section we will determine the performance of a k-ary clustering technique, when the 
permutations describing the method are chosen at random. What we mean by this is that we 
consider all possible (m-l)!"^ selections of m permutations defining a secondary clustering 
matrix such as described in Section 1.4. as equally likely. Similarly, we have (m-2)!'" 
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part (a): empty positions numbered from left to right 
random insertion 




1,2,... = new numbering 
1.2,... = old numbering 



5 our oracle O 



take out after 
new Insertion 



any oracle 



part(b): oracle insertion 

9 ' ' 




3 X 4 

I I = position filled 



5 our oracle O 
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Figure 2.3.1. THE ORACLE ARGUMENT 
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choices in the case of tertiary clustering, and so on. The analysis of random secondary 
clustering was first carried out correctly in exercise 6.4-44 of [Knuth2] by a considerably 
more complicated method. The results we get are interesting because they provide us with 
an idea of how other methods very similar to linear probing and double hashing might 
perform. We find, for example, that the random k-ary clustering technique with k>1 has 

C'n = 1/(1 -a) + 0(1/m), n = am, m -♦ oo. 

This is interesting from several viewpoints: 1) as we saw in Section 2.3., there are k-ary 
clustering techniques whose performance is very bad, that is for which C'^ grows linearly 

with m; 2) as we remarked earlier, 1/(1 -a) is also C'^ for uniform hashing, which implies, for 
example, that on the average tertiary clustering techniques are very good, since they 
perform just as well as uniform hashing which needs to compute an arbitrary number of 
independent hash functions, not just two. For secondary clustering we find that 

C'n = 1/(1-a) - a - log(l-a) + 0(1/m), 

which is significantly better than linear probing. This can be realized in practice by letting 
h2(K) = f(h^(K)) in the double hashing algorithm, where f is some easily computable function. 



We now begin the mathematical portion of this section where we perform the analyses 
described. The calculations are somewhat involved, so we will emphasize only the key 
steps. Let p^g denote the probability that when the (n+1)-st key is inserted in the table s 
probes will be made. We let p = 1/m , q = 1-p, and make use of our falling-factorial power 
calculus. The following two identities are easily proved by induction or by techniques 
analogous to those used in Section 2.3.: 

(11) 2 a-/b^^ = 1/(b-a), and 

0<t<a 



(,2) S '-t-'^J 



kaVb^--^ = a/(b-a+1)(b-a), 
0<t<a 



where we interpret x- = 1. Let us now look at Figure 2.4.1. which illustrates the matrix 
defining a secondary clustering technique. The basic observation is that we need not know 
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Figure 2.4.1. 
THE MATRIX DEFINING A 
SECONDARY CLUSTERING TECHNIQUE 
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what an element of this matrix Is, until we have a need to look at it. When the (n+1)-st key 

K^^^ is inserted, the only row of the matrix that Is of interest to us is row h(Kj^^^), In which 

case a portion of this row has already been specified. Specifically assume that t-1 Initial 
elements of this row have been specified. Since we want to compute the probability of 
making s probes. It follows that we will have to specify s-t additional elements of this row. 
The first s-t-1 must specify table positions that have already been occupied, whereas the 
last one must specify an empty position. We now break p^^g into a number of components, 

according to the greatest i<n for which h(Kj) = h(K^^^). We further note that the probability 

of filling the remaining s-l positions of the row as described above when s>k (in this case 
k=1) Is 

(m-n)(n-t)^^^^/(m-t)^ , 

because for the t-th we have n-t choices of occupied positions to be selected among the 
m-t possibilities for continuing this row, for the (t+1)-st we have n-t-1 choices among the 
m-t-1, and so on till the last one. This last one, however, is exceptional because it must hit 
an empty position, so we have m-n choices out of m-s+1. As a result of all this we can write 
down a recurrence relation for the probabilities p^g. For the general k-ary clustering case 

we obtain the following: 



Pn,1 = (m-n)/m, 

Pn 2 " n(m-n)/m{m-1), 



k-1 k 
Pp 1^ = (m-n)n — /m-, since the first k probes are independent and 

random, and for s>k we have 



2n~l ^^ s~t~1 s~t 

pq Za Pj,t (n-t) /(m-t) — + 

1<i<n k<t<s 

+(1 -S Pnj -2 Pq""'(1 - 2 Pjj))(n-k)^^^^/(m-k)^^ ]. 
1<j<k 1<i<n 1<j<k 
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The factor pq""' Is the probability that h(Kj) = h{K^^^) and for i<j<n, h(K|) ^ hCK^^^). Similarly 
the multiplier of (n-k)^^^^/(m-k)^^ on the next line is the probability that h(Kj) * h(K„) foi 
1<i<n and that the first k probes collide. If we now set 

«'n= S (s-1)Pn,s 

1<s<n 
= C'„ - 1, 



we see that the above recurrence for p^^ can be transformed into one for a\. In order to 
perform this manipulation we need the relation 

2 (s-1)(n-t)^^^^/(m-t)^ = m/(m-n)(m-n+1) + (t-1)/(m-n+1). 
t<s<n 



which is an easy consequence of (11) and (12). If we Interchange summation on s and t we 
obtain a recurrence for a'^ in terms of a'j, i<n, which can be most easily stated if we use the 

substitution: 



«'n - ((k-1) - S (k-i)Pn.j) 
1<i<k 



This aforementioned recurrence is 



a 



«"o= 0- 



2n~i k k 

pq a": + (m-k+1)n-/(m-n+1)m-, 



I 
KKn 



If we now set 

i8n= (m-n+1)aV(m-n) 

we can subtract two versions of the above recurrence, one for n and one for n-1, to obtain a 
recurrence in which j8^ depends only on ^^.y 
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jSn= (1 - 1/{m^(m-n+2)))j8n.i + 



^(m.n) - (1 - 1/(m-(m-n+2)))nm,n-1). 



fio- 0. 



where 



k k 

f (m,n) = (m-k+1)n-/(m-n)nT-. 



This in turn can be simplified a great deal if we use the identity 

(m-k+1)n-/{m-n+1)m- + ((k-1) - 2 (k-j)pn,j) = n/{m-n+1), 

1<i<k 

and define 

Xn = Pn ^ Hm.n) = (m-n+1) (a'p - n/(m-n+1)) / (m-n). 

The final result is contained in the following theorem: 
THEOREM 2.4.1. For random k-ary clustering we have 



C'p = (m+1)/(m-n+1) + (m-n)xn/(m-n+1), where 

Xn = (1 - 1/(m-(m-n+2))) Xn-1 + (n-1)-/{m^m^m-n+2)), 

Xo = 0- 



Asymptotically we have, for k=1 (secondary clustering) 

C'„= 1/(1-a) - a - log(1-a) + 0(1/m), 

while for k-ary clustering with k>1 we have 
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C'^r 1/(1-a) + 0{^/m^'^), 



as m -» 00, n = a(m+1), fixed a < 1. 



Thus random tertiary and higher clustering techniques are asymptotically equivalent to 
uniform probing. 



Proof. Note that (m+1)/(m-n+1) is the exact performance of uniform probing as analyzed in 
Section 1.5. Also note that C'^ equals this value exactly until n exceeds k, for until then no 

clustering can occur. The argument preceding the Theorem has established the recurrence, 
so we only have to prove the asymptotic results. For k=1 we have 

Xn = (1 - 1/(m(m-n+2))) Xn-1 ' ^^^ + (m+1)/m(m-n+2)). 

If we let 

'^n = ^n-1 ■ ■'/"^ + (m+1)/m(m-n+2)) (i^q = 0), 
it is clear that i|/^ = Xn ■•" 0(1/m) and that 

rp^i = -n/m + (m+1)/m [H^_n+2 " ^m-2^^ 
from which the assertion of the Theorem for k=1 is clear. 

For k>1 the argument that bounds Xn 's even more obvious, showing that this quantity only 
contributes to the O term. I 



For k = 1 and 2 (and possibly all k) we can obtain a "closed form" expression for C'^^ as a 
finite product. The manipulations are somewhat involved, but the final results are: 
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for secondary clustering (p = 1/m) 

n-1 
C'n = 1 + m + (m-n)[p/(1-p)] - [(m+1-p)/{1-p)] U (1 - p/(m-i+1)), 

i=1 

and for tertiary clustering (p = 1/m(m-1)) 

C'n = 1 + m + (m-n)[p(3-p)/(1-p)(2-p) + pn/(2-p)m] 

n-1 
- [2(m+1-p)/(1-p){2-p)] n (1 - p/(m-i+1)) . 

i=1 

The above two product formulae are due to a suggestion of ([Paterson]). 
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"breed the best and forget ttie rest", 

cattle ranch advertisement on California route 132 



CHAPTER 3: 

THE ANALYSIS OF DOUBLE HASHING 



This entire chapter is devoted to the analysis of one algorithm, double 
hashing. The technique used Is distinctly different from that of the 
analyses of the introduction and the previous chapter. We can no longer 
obtain exact recurrence relations for the means. Instead, we prove our 
asymptotic limit by showing that configurations of occupied positions on 
the table that would exhibit an average number of probes very different 
from this limit have negligibly small probability. In spirit the method is 
mostly akin to the probabilistic method of Erdos ([Erd-Spen]). The one 
fact that makes the argument possible is that the probability of being at 
least a fixed percent away from the mean in a series of n Bernoulli trials is 
exponentially small in n. Since arithmetic progressions are the natural 
probe paths for double hashing, we begin this chapter with a study of the 
occurrence of arithmetic progressions in random subsets of specified 
cardinality that are composed of entries in our table. 
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3.1 The Lattice of Arithmetic Progressions Coming From a Set to a Point 

Let Z^ denote the additive group of integers modulo m. We can think of these integers 
arranged in a circle, with following m-1, as depicted in Figure 3.1.1. 

In the entire context of this chapter m is a (sufficiently large) prime number. For any subset 
HCZ^ we can count the number of arithmetic progressions that begin at and whose next k 

elements lie in H. By an arithmetic progression we mean a sequence Xq,x^,...,x,^, such that Xq 

= 0, x^,X2,...,x^ are elements of H, and Xj^^-Xj (mod m) is the same for all I = 1,2,...,k-1. (The 

point need not Itself belong to H.) The primality of m guarantees that all the Xj will be 

distinct. We will speak of such an arithmetic progression as a progression of length k 
coming from H to 0. We can generalize this concept if we allow that for each I, 1<i<k, we 
specify whether the corresponding element of the progression is to be In H, or In the 
complement of H. Thus we arrive at the concept of a type of a progression. A type r of 
length k can be thought of as a boolean vector of k bits. An arithmetic progression of length 
k coming to is of type t if the i-th element of the progression is in H or in the complement 
of H, according to whether the i~th bit of t is a 1 or a 0. A 1 of the type will also be called a 
t)it, whereas a will be called a miss (for obvious reasons). We will display a type by 
writing down the corresponding bit vector, e.g., t = (10110001). Any type t has a length 
that will be usually denoted by k, and a number of hits, that will be usually denoted by /, 
0:^'<k. Thus the above type has k = 8, / = 4. We will reserve the expression "a 
progression of length k coming from H to 0" to mean a progression of the type (11...1) with 
k hits. For any type t and set H we can consider all the progressions of that type coming to 
0. We will speak of these progressions as belonging to t. For a fixed length k, the set of all 
types of that length forms a boolean lattice (or algebra), in the usual way. Figure 3.1.2. 
illustrates some of the ordering relationships. 



The above lattice structure will not be important immediately, but will play a significant role 
in the latter half of this chapter. 



To fix the ideas let us now confine our attention to arithmetic progressions of length k 
belonging to the type of all hits. Clearly the number of progressions belonging to this type 
depends heavily on the set H. We can expect at most to make a probabilistic statement 
about the distribution of the number of these arithmetic progressions. We will be interested 
in such an estimation for large m with H of specified cardinality |H| = Tjm, where 0<tj<1. All 
subsets of this cardinality will be considered equally likely. As we let m get large, we will 
allow both k and -q to vary with m. However, in order to make our argument work, we will 
see that we have to restrict the growth of k and/or the speed with which ri can approach 
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Figure 3.1.1. 



THE ADDITIVE GROUP Z 



m 
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Figure 3.1.2. THE LATTICE STRUCTURE OF THE TYPES 
OF ARITHMETIC PROGRESSIONS 
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or 1 . In this and all subsequent sections, unless otherwise stated, our O and o notations will 
always refer to m -* oo. The Implied constants, unless otherwise stated, will be absolute. 



What is the expected number of arithmetic progressions of length k coming from H to 0? 
There are m-1 arithmetic progressions in Z^ coming to 0, one for each possible distance 
1,2,...,m-1. Each such progression occurs In H with probability 



tjm(Tim-l) ... (ijm-k+l) 

— = (1 + 0(1)) T,^ 

m(m-1) ... (m-k+1) 



If k is suitably small (log k < {V2'-e) log m will do, though in our applications we will always 
use a k that is 0(log m)) and rj is bounded away from 0. Thus the expected number of 
arithmetic progressions of length k coming from H to is (1+o(1))t) m. Let 6 denote any 
small positive constant. In the following three sections we will prove that the fraction of 
choices of H for which the number of these arithmetic progressions is outside the range 
Tj m(1±6) is exponentially small in m. By exponentially small we mean that there exist 
positive constants C, s such that this fraction is 

exp[-C6%*'m/(k®log^(Ti'V))] 

k u 

provided ij m > m , where ft denotes any positive constant. 



Our method is briefly the following. In Section 3.2 we consider the hypergeometric 
distribution, which arises when we compute the probability that two subsets of size am, j8m 
of a set of m elements have an intersection of the expected size a)Sm. We show that the 
probability of the intersection having a size outside the range a^m(1±£) is exponentially 
small in m. In Section 3.3 we use the Farey series to subdivide the circle formed by the 
reals (mod 1 ) into arcs, such that all arcs except those containing certain fixpoints have the 
property that any two among such an arc's first k multiples are disjoint. By the j-th multiple 
of an arc (interval) Cx,y) we mean the arc [jx,jy) (mod 1). In Section 3.4 we use this 
subdivision, together with our estimation of the tail of the hypergeometric distribution, to 
prove the desired result. 



The idea of Section 3.4 can be illustrated by an example. Suppose k = 2 and consider an 
arc [x,y) which is disjoint from the arc [2x,2y), as shown in figure 3.1.3. 



Now suppose we pick H, a random set of rjm points of the circle. What is the expected 
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Figure 3.1.3. AN ILLUSTRATION OF THE "PULL-BACK" ARGUMENT 
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number of arithmetic progressions of length 2 coming from H to whose first point lies in 
the set J(x,y) = {z€ZJx<(z/m)<y}? Instead of repeating the argument given earlier, we can 

proceed as follows. The interval J(x,y) has size {y-x)m. Consider the set 2J(x,y) = 
{2z|z€J(x,y)}CJ(2x,2y). This set also has cardinality (y-x)m, and is the locus of the second 
points of progressions whose first point is in J(x,y). We expect (y-x)ifjm of the points of 
2J(x,y) to be hit by H. The set of hit points can now be "pulled-back" to J(x,y) to give us 
the candidate first points of these progressions. Since J{x,y) and J(2x,2y) are disjoint, the 

2 2 

expected number of points in this subset of J(x,y) that will be hit is (y-x)i] m. So (y-x)7| m 
is the desired average. This argument illustrates how we can translate our knowledge of the 
probabilities for the size of set intersections to probabilities for the occurrence of arithmetic 
progressions in H. 



All of the above remarks apply verbatim to types other than the type with k hits. If our type 

k / k-/ 

has / hits then we only need to replace -q by 'q{^-q) everywhere in the above discussion, 

and restrict tj away from both and 1. Circular symmetry implies that our results also hold 

for any point of Z^^, not just 0. Our method solves the corresponding problem when we do 

not allow wrap-around or when we specify an upper bound on the number of times we can 
wrap around. We do not need this result, so we will not dwell on it any longer here. We 
will, however, need a slight generalization of the case k = 2 shown above. Given two points 
x,y€Z^, we will say that these points are in the ratio a:b if xb = ya (mod m). Given a fixed 

ratio a:b, 1 <a,b<k, we will want to estimate the number of pairs of points (x,y) of H that are 
in the ratio a:b. 



3.2 The Tail of the Hypergeometric Distribution. 



In this section we estimate the tail of the hypergeometric distribution a specified percent 
away from the mean. Properties of the hypergeometric distribution are discussed, for 
example, in ([Feller]). Since we are interested in large deviations, the normal approximation 
will not be useful to us. Instead we will need an approximation more like the one done for 
the tail of the binomial distribution in {[Renyi]). 



Suppose we have a sample space of size n, and we select from this space two subsets, one 
of size an, the other of size fin {0<a,j8<1). 
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The probability that the cardinality of the intersection of the two subsets is k is 
a,^ = C{an,k) C((1-a)n,j8n-k) / C(n,i8n). 

/ ^ ^^ ^ 

choose k of choose others total number of 

fin's from from rest ways to choose 

an of n 

The expected value of k is easily computed to be a)Sn(1+o(1)). We will estimate the 
probability that k lies outside the range afin i1±e). For this section only, our O notation will 
refer to n ~* oo. 



THEOREM 3.2.1. Let y{n,a,fi,e) denote the probability that if we randomly select two sets of 
sizes an and ^n respectively out of a set of size n, their intersection will have cardinality 
outside the range a/3n(1±e). Then as n -♦ oo, provided 

< a,fi and a(1+e), fii1+e) < 1, 
where a, fi, e can vary with n, we have 

Y{n,a,fi,e) < K(1+1/£)e'^^^^«''", 
where (p{e) > (1+£)log(1+c)-e + V2E\afi/{1-a){1-fi) + a/(1-a) + fi/{1'fi)'] 
and K is an absolute constant. 
The same conclusion holds If one of the two sets in question stays fixed. 

Proof. We will first estimate the tail of the distribution above the mean. The tail below can 
be estimated in an essentially identical fashion. 

We wish to estimate the sum 

2 a,^ = 2 C(an,k) C((1-a)n,i8n-k) / C(n,^n). 

k>a^n(1+E) k>a^n(1+c) 

Note that the ratio of two successive terms in this sum Is 
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\^^^\ < (an-k)()Sn-k)/k((1-a-^)n+k), 
which is a decreasing function of k in the range of interest, i.e., a^n < k < an,j8n. 

For k = ai5n(1+e) this ratio is less than 

p = [ {a-a^-afie) {fi-afi-aPe) ] / [ {a^+aPe) {^-a-p+afi+afie) ]. 

It easily follows that p < 1. Therefore our sum is majorized by a convergent geometric 
series of ratio p, and we get a bound of 

[1/(1-p)] C[an,a/Sn(1+e)] C[(1-a)n,(i8-ai8(1+e))n] / C(n,i8n). 

Since, as we can easily check, 

1/(1-p) = 1 + (1-a-ae)(1-i8-)8£)/6 < 1 + ^/e, 

we are only left with estimating the density of the hypergeometric distribution at k = 
afin{^+e), as given above. 

We will use Stirling's approximation for the factorial: 

log n! = n log n - n + Valog n + Vz log 27r + 0(1/n). 

From this we can easily derive the following fact: 

log C((x+y)n,xn) = {n+Vz) ((x+y)log(x+y) - xlogx -ylogy) - V2logn + 0(1). 

We can now apply this fact to the binomial coefficients we have and obtain after 
simplification 

log [ C(an,a)8n{1+£)) C((1-a)n,(iS-a)8(1+c))n) / C(n,i5n) ] = 
-la^O+e) log{1+c) 

+ a(1+i8){1-[(/8£)/(1-i8)]} log{1 - [(i5e)/(1-)8)]} 
+ i8(1-a){1-C(a£)/(1-a)]} log{1 - [{ae)/(1-a)]} 



Double Hashing, Page 64 

+(1-a)(1-i8){1 + C(ai8£)/(1~a)(1-i8)]} log{1 + [(aiS£)/(1-a)(1-i8)]}] n 
+ 0(1). 

The following two inequalities are elementary: 
{1+x) log(1+x) > X for X > 0, 
(1-x) log(1-x) > -X + (x^/2) for 0<x<1. 

Since a(1+£)<1 is equivalent to ae/(1-a)<1, and similarly for p, we can apply these 
inequalities to the above expression to obtain the upper bound 

-[a)3£ + a)3[(1+£) log(1+£) - £] 

-afie + a^fie^/2{1-a) 
+ay8£]n, 

from which the conclusion of the theorem is immediate. 

For the lower tail a similar argument gives an upper bound of 

-[«i8[(1-£)log(1-£)+€] + •/2a^j8V/(1-a)(1-)8)] n + 0(1). 
Now (1-£)log(1-£)+£ > (1+£)log(1+£)-£ and the theorem follows. I 

Remark 1. Notice that the above argument does not require £ to be small. 

Remark 2. If, however, e is small, say £<£o, then ^(c)>(l+e) log(1+£)-c>Ce where C 

2 

depends on £q. If afine /log(1/£) > N, where N is a sufficiently large constant, then we can 

take C so small that the factor K(1+1/£) is absorbed in the reduced exponent. Then we can 
state our conclusion as 
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COROLLARY 3.2.1 

2 2 

Y(n,a,j8,£) < exp(-Ce afin) for e<eQ, afine /log(1/£) > N, N and C positive 



constants depending on Eq. 



Tliis is the form in wliich theorem 3.2.1 will be used most often. In our applications in fact, 

2 

afine /log(1/c) will tend to oo with n. 



The key property of our estimate is that it is exponentially small in n. An estimate obtained 
by using the variance and Chebycheff's inequality can only give us a bound for this tail that 
vanishes no faster than an inverse power of n. 



3.3. The Farey Subdivision of the Circle. 

The Farey series F^ of order n is the ascending series of irreducible fractions between 
and 1 whose denominators do not exceed n. For example, Fg is 

0/1, 1/5, 1/4, 1/3, 2/5, 1/2, 3/5, 2/3, 3/4, 4/5, 1/1. 
The Farey series possesses many fascinating properties ([Hardy]). 
Property 1. If h/k, hVk' are two successive terms of F^, then kh'-hk' = 1. 

Property 2. If h/k, h'Vk", and h'/k' are three successive terms of F^^, then 

h'Vk" = (h+h')/(k+k'). 
Property 3. If h/k, h'/k' are two successive terms of F^, then k+k'>n. 

Property 4. If n>1, then no two successive terms of F^ have the same denominator. 

2 2 

Property 5. The number of terms in the Farey series of order n is asymptotically 3n /it 
+ O(niogn). 

We will be interested in the circle of the reals (mod 1), denoted by U. The set U forms a 
group under addition. Consider the mappings 
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Tji X ^ ix (mod 1), x€U 

for each i = 2,3,. ..,k. (T^ is the identity.) It should be clear that the fixpoints (i.e., points x 
for which TjX = x for some j) of these mappings are the fractions a/b with 0<a<b<k. 
These are exactly the elements of F,^.^CU. 



We wish now to partition U Into a collection of disjoint intervals (taken to be left closed, 
right open) J with the property that (1) if V€J, then T^V, T2V, T3V, ..., T,^V are all disjoint if 

V does not contain one of the above fixpoints, and (2) the V€J that contain a fixpoint can 
be made arbitrarily small in length. 



Let <p^ denote (the cardinality of Fj^)-1. We now consider the subdivision of the circle 
defined by the Farey series F2,^_2. Clearly this contains the fixpoints discussed above 
(F,^.^CF2,^.2) ^"^ subdivides the circle into <P2k-2 'nt6''vals. We have 

LEMMA 1. No two fixpoints (i.e., elements of F,^.^) are adjacent in F2,^.2. 

Proof. If fixpoints h^/k^, h2/k2 are adjacent in F,^_^, then k^+k2<2k-2. Hence by Property 
3 they cannot be adjacent in F2,^.2. • 

For i = 1,2,...,<p,^.^, let us name Lj and Rj respectively the intervals of the above 

subdivision that lie to the "left" and to the "right" of the l-th fixpoint (In the standard 
order). 

From Lemma 1 it follows that the other endpoint of each Lj and Rj is not a fixpoint. We 
name the remaining intervals Nj, i = 1,2,...,((p2k_2"'2<P|^_^). 

LEMMA 2. Let X stand for one of the L, or Rj. Then any two of 

1 ' '2 ' "*' k 

will overlap only if they have an endpoint in common. 

Proof. Let the i-th fixpoint be h^/k^ and to make things concrete suppose we are dealing 
with Rj (the case of Lj is entirely analogous). Let h2/k2 be the other endpoint of Rj. Then 
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by Lemma 1, hg/kg^F,^.^, so k2>k. Then the length of Rj is 

(hg/kg) - (h/k^) = (l/k^kg) < (l/kk^). 

Thus the length of any of the T:Rj is <(1/k^). But all multiples of (h^/k^.hg/kg) start at a 
multiple of h^/k^. These multiples are spaced at least 1/k^ apart, so no two of the 
multiples of Rj will overlap unless they share a common left endpoint. I 

LEMMA 3. Let Y denote one of the Nj. Then the intervals 
T^Y, TgY, ..., T^Y 

are all disjoint. 

Proof. Suppose intervals A and B in the above sequence intersect. By construction A 
and B do not share any endpoints. Thus if they intersect, we can assume that the left 
endpoint of B lies within A. Let Y be (h^/k^,h2/k2). The distance between the left 

endpoints of A and B cannot be less than 1/k^, since the left endpoints are multiples of 

1/k^. However even the longest multiple has only length k(h2/k2-h^/k^) = k/k^k2 < 1/k^, 

since k2>k as no endpoint is a fixpoint. But this contradicts our assumption that the left 

endpoint of B lies within A. I 

We now describe how to subdivide the Lj and R,. further, so as to make the intervals with 

an endpoint at a fixpoint as small as we please, while still maintaining the property that all 
other intervals have disjoint multiples. We describe the construction for R^, that for L. 

being analogous. Let us define for each i a subdivision into intervals RSj:, j = 1,2,...,/, and 

RM.. If Rj = [x,y), these subintervals are defined as follows: 



RSjj = [x+({k-1)/ky(y-x), x+((k-1)/k)'"''(y-x)), j = 1,2,...,/ 
RM. = [x, x+((k-1)/k)'(y-x)). 



The following facts are then obvious: 
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/ 
(1) RMj U U RSji = Rj, 



(2) any two of RSj^, RSjg RSj,, RMj are disjoint, 

(3) RSjj has length ((k-1)/ky"''(y-x)/k, and RM^ has length ((k-1)/k)'(y-x) (y-x 
length of R:). 



LEMMA 4. Let Y denote any of the RS-.. Then the intervals 
T^Y, T2Y, ..., T^Y 

are disjoint. 

Proof. We first prove the lemma for i = 1, i.e., for a subdivision of the interval R^ whose 

left endpoint is (= 1). As 1/k€F2,^_2, R^C[0,i/k) so we don't have to worry about 

wrap-around problems for any of the RS^:. To complete our argument we only need 

show that the right endpoint of the (t-1)-st multiple does not exceed the left endpoint of 
t-th multiple. This is tantamount to 

(t-1)((k-1)/k)'"'' < t((k-1)/ky 

or 

(t-1)/t < (k-1)/k, 

which is certainly true, as t only takes the values 1,2,,..,k. This subdivision is nicely 
illustrated by Figure 3.3.1. 

Now to handle the case of i>1 we need only recall from Lemma 2 that two multiples of Rj 

overlap only if they have a common left endpoint. But then the situation at each such 
endpoint is a subcase of the situation described above for i=1 around 0. So by the same 
argument the multiples of RSj^ are disjoint. I 

We can of course repeat the whole construction and the proof of Lemma 4 for the Lj. 
Thus we obtain intervals LS,:, j = 1,2,...,/, and LMj that also satisfy (1), (2), and (3) above. 

Before we recapitulate what we have derived in this section we need to make a comment 
about the lengths of the intervals. Each Rj or Lj, being an interval [h^/k^,h2/k2) between 
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Figure 3.3.1. 

THE SUBDIVISION INTO INTERVALS OF NON-OVERLAPPING 
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2 

two elements of Fgj^.g, has length 1/k^k2>1/4k . On the other hand either k^ or kg is 



>k, so the length is at most 1/k. 



Combining all our constructions we have the following theorem. 

THEOREM 3.3.1. We can construct a partition of the reals mod 1 into disjoint subintervals 

Nj' ' = "^.2 <P2k-2~2<Pk-1' 

LSjj, LMj, RSjj, RMj, i = 1,2 (p^^_^ j = 1,2,...,/ 

so that 



1 . each of the N^ LSj., RSy has (a) disjoint first k multiples and (b) length at least 
{(k-1)/k)'"V4k^ and 

2. each of the LMj, RMj has (a) an endpoint (the right or left one, respectively) 
which is a fixpoint and (b) length at most ((k-1)/k)/k. 



3.4. The Estimation of the Arithmetic Progressions, and the Prevalence of Randomness. 

We first map intervals on U to intervals on Z^. Corresponding to an interval [x,y) C U we 

have the set of all i€Z^ with the property that x<(i/m)<y. (This should be interpreted 

cyclically; that is, if x>y, then we mean x<i/m or i/m<y). We will now use the names of 
the intervals introduced in Section 3.3 to denote also the corresponding intervals in Z^. 

For an interval T = [x,y) we will denote by t its length in U (t = (y-x)) and by |Tj the 
number of integers it contains in Z^. Clearly we have jTI = (number of i such that 

xm<i<ym) = TymT - fxml. Thus 

|T| = tm±2. 
We will write this as 
(i) IT! = tm(1±e), 
where c will be a quantity used below. This is justified as long as e is not too small. As 
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we will see below, the smallest e we will use will be Q(1/(log m)*^) for some positive r, 
while the smallest t will be such that tm>m for some positive X. So as m -»• oo, 
equation (i) Is justified; its form will simplify some of the computation below. 



Recall that we are selecting a random subset H of Z^^ of cardinality T|m. For any interval 

(in fact any subset) T of Z^, we can ask for the number of elements of H that will fall in 

T. From Theorem 3.2.1 we know that the number of these elements will be |T|ti{1±c), 
except with probability Y(m,|T|/m,Tj,£), i.e., it will be Tjtm(1±e) except with probability 
Y(m,t(1±£).i),£). 

Consider now the collection D of intervals composed of all the Nj, LSy, RSjj, and the 

collection C composed of these same intervals and their first k multiples. In this latter 
case we are dealing with a total of k(2<p,^_^/+<p2,^.2-2<p^_^) intervals. For each interval T in 

the collection C we assume that H will intersect it in T]tm(1±c) elements, as in the 
discussion above. This will always be the case except for a fraction of choices of H that 
is bounded by 

Q = 2 Y(m,t(1±6),Ti.£). 
TfiC 



(This argument does not need any independence assumptions concerning the various 
choices of T.) Thus with probability 1-Q, our choice of H will intersect each interval in 
the collection about as often as we expect. 

We now restrict T to be one of the elements of D. In the sequel £ will denote a small 
quantity that will define all our relative errors. We allow £ -^ as m -* oo, and we also 
allow e to depend on our choice for T. (We write Ej when we need to make this 

dependency explicit.) Let us consider the first k multiples of T, and focus our attention 
on the last one T,^T. This interval has tkm(1±£) elements, and will almost certainly 

2 

receive tkT]m(1±£) elements of H. Within T,^T we have a subset S,^ of cardinality 

tm(1±£), consisting of those elements that are k-fold multiples of elements of T. How 
many elements of this subset will H hit? (Note that these points are the endpoints of 
arithmetic progressions starting at 0, having their first element in T and their k-th in 

T(^T.) Now within T,^T itself we can invoke Theorem 3.2.1 to show that, except for a 

fraction of possibilities that does not exceed Y(tkm(1±£),(1/k)(1±£) ,t|(1±£) ,£) the 
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number of elements of S,^ that H will hit will be i|tm(1±c) . 

Consider now the Tjtm(1±£) progressions thus specified. We apply the "pull-back" process 
illustrated in Section 3.1. and in the following Figure 3.4.1. What about the (k-l)-st points of 
these progressions — how many of these points will be hit by H? By construction, all these 
points from a subset S,^.^ of T^.^T, an interval disjoint from T,^T, By Theorem 3.2.1 confined 

2 13 

now to the Interval T^.^T we see that the intersection of S,^.^ and H will be ij t(1±e) m 

points, except with probability Y{t{k-1)m(1±c),(ij/(k-1))(1±e)®,Tj(1±£)^€). (To amplify, we 
have here an interval of t(k-1)m(1±£) points; the set S^^_^ is of size (Tj/(k-1))(1±£) times 

3 

the size of Tj^.^T; and t/(1±c) is the probability that a point in Tj^.^T will be hit by H. The 

basic rule we are using throughout is that if x = X{1±c)', y = Y(1±e)', then xy = XY(1±£)'"*"', 
x/y = (X/Y)(1±£)''*''. To make these rules precise it is best to define x = X(1±£) to mean 
x€(X(1-£), X/{1-£)). Note that this redefinition of 1±£ leaves Theorem 3.2.1. valid.) 



We now have a set of progressions whose last two elements are guaranteed to be hit by H. 

2 13 

At the next step we consider the (k-2)-nd points of these ij tm(1±£) progressions, which 
define a subset S,^_2 of T,^_2T. By analogous computation we obtain that ij tm(1±£) of 

these points will belong to our random set H, except with probability 
Y((k-2)tm(1±£),(7j^/(k-2))(1±£)^'^,Tj(1±£)^,e). We now continue in this fashion with the 
(k-3)-rd,...,1-st points of the arithmetic progressions. The illustration 3.4.1. depicts this 
pull-back process in which we successively commit H in the intervals TjT, i = k,k-1,...,1. At 

the last step of this process we are considering T itself. After that step the number of 

k 6k+1 

candidate arithmetic progressions left will be ij tm(1±£) . These are now confirmed to be 
entirely in H. The fraction of choices for H that we have eliminated in this process is 
bounded by the sum 

k-1 



2 Y((k-i)tm(1±£), (V/(k-1))(1±£)®'*^ i](1±e)^ £) 



i=0 



of all the excluding probabilities. 



We now conceive of this process of selection of candidate arithmetic progressions as being 
carried out for T referring in turn to each of the intervals in D. The total fraction of choices 
for H thus excluded is bounded by 



Double Hashing, Page 73 



Figure 3.4.1. THE PULL-BACK PROCESS 
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k-1 
W=2 2 Y((k-i)tm(1±£T.), (r?V(k-1)){1±£Tf'*^ i?(1±€t.)^ Bj). 
T€D i=o 

What has all this accomplished? After excluding the choices for H accounted for in Q and 
W, we can be sure that the number of arithmetic progressions of length k coming from H to 

k 6k+1 

0, and whose first point Is T, Is ij tm(1±ey) , where T is any of the above intervals. Thus 
the total number of arithmetic progressions coming to from H of length k is 



■q tm(1±Cj) + E 

TGD 



where E is a correction coming from the fact that we cannot apply our argument to the 
fixpoint intervals LMj and RMj. But each of these special intervals cannot contribute more 

arithmetic progressions than its length. Thus the error committed is bounded by 
< |E| < 2<p^.^[(1/k)((k-1)/k)'m+2]. 

(See Theorem 3.3.1 and recall that there are 2<p,^.^ such intervals). 



Now let 6 be a given small positive constant. We choose / to be the smallest positive 
integer such that 

(ii) 2<pj^.^(1/k)((k-1)/k)m < ri^m8/4. 

Thus / = r(klog(1/'»j)+log<)p,^.^-3log2+log(1/5))/(log((k-1)/k))1. By choosing 8 small enough, 

and using the fact that |log(1-1/k)| > C/k for k>1 and some positive constant C, we see 
that / will always satisfy 

(iii) / > 2k. 



k u 

Since tj m > m*^ it is also clear that for m sufficiently large 

(1/k)((k-1)/k)'m > 2, 



and so 
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(Iv) |E| < 2<p^.^[(1/k){(k-1)/k)'m+2] < 7i''m6/2. 

The total number of intervals in our partition is then 

F = 29k.i(/+1)+<p2k-2-29k-i = <P2k-2+2'<Pk-r 
If t denotes the length of an interval in our collection D, then from Theorem 3.3.1. we have 
t > ((k-1)/k)'"V4k^ > Tj''6/(32<p,^.^k(k-1)) 

k 4 

> Stj fi/k for some positive constant S. 
We can therefore write 
(v) ST|''6/k'^ < t < 1/k. 

(See Properties 3 and 5 of Section 3.3). 

For an interval T€D of length t, let Ej be defined by 
1/(1-£^) = (1 + 8/{2\F))^^^^^^^\ 

so we assign larger relative errors to smaller intervals. 

Now we are ready to total the number of arithmetic progressions we have of length k coming 
from H to 0. This number cannot exceed 

ri'ir\{8/2) + 2 71^(1 +5/(2Ft)) < Ti''m(5/2) + rj^'m + T|''m(6/2) 

TeD 

< 7}%(1+6). 

In order to obtain a lower bound we use the elementary inequality 

(1+x)'^>(1-x)^ for 1>x,y>0. 
Then we have 
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l-c^. * (1 + a/StF)"'^^^''*^^ > (t - 5/2tF) 



i/(6k+l) 



We must avoid, however, situations where 5/2tF comes too close to 1. We stipulate 
therefore that we will not attempt to maintain a lower bound during the pull-back process for 
an interval T, unless tP > S. Thus we will ignore lower bounds for intervals T much smaller 
than the average (the average length of an interval is 1/F). As our intervals form sequences 
with lengths in geometric progressions, we expect that the total length of the uncontrolled 
intervals (we include in this the LMj and RMj) will be small. First of all it is easy to see that if 

5 is small enough then none of the intervals Nj will violate the condition tF>5. In each 

sequence the total length of the intervals violating our condition Is certainly less than 

00 

(5/F) 2 ((k-1)/k)^ = ak/F 
i=0 

for a total contribution not exceeding 

S 2<pj^.^k/F. 



But we have F>2f<p^_^>4k<p^.| by (ill), and so the total length of the uncontrolled intervals 
does not exceed 5/2. Therefore the total of progressions we are counting is at least 

2 ij'^mtd-Ca/StF)) > (1-(5/2))T/''mt - n''m{5/2) = i/^'mft-S). 
T€D 

In summary, the total of our progressions is 

as we had hoped to prove. (Note again 1+S<1/(1-5)). 

This, of course, Is a useful result only if we can show that the sum Q+W of the excluding 
probabilities is small. To prove this we will need to restrict our -q and k. We will assume that 

k It 

T| m>m'^, k = O(logm) 

where ft, Is any small positive constant. We now show that each term in Q or W is 
exponentially small in m. We will determine an upper bound for the largest term, which 
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certainly occurs in W. A candidate term has the form 

/ / \/4_i_ v{6i+6)/(6k+1) i+1. » 

exp(-<p{e-p)(1±€-r)' 'ri tm). 

We first treat the 1 - Cy case. For any interval T we are attempting to control we have 

1-€t > (1-(5/2tF))^'<^'*'^ > {^/2)''^^'''■'\ 
Thus the absolute value of the above exponent is at least 

For the case 1+c-,- we have 

> (5/2F)'^'*^'^<^''*^' (6(k-i)/(6k+6) i*1 ^ 

> (5/2F) t«(l^-')^<6K-6) ^i.i ^_ 

k 4 

as certainly we can take 6<1. Let now t have its smallest value Sij 5/k (from (v)) and 
obtain a lower bound of 



Thus the largest term does not exceed 



exp(-<p(£-p)(S5^/2Fk'^)T|*'*^m). 



Finally for <p(ey) we have 

<f{ej)>{^+ej)\og{A+ej)-ej 

by Theorem 3.2.1; note also that (1+x)log(1+x)-x is an increasing function of x. Now 
e^ = 1 - (1 + (5/2tF))"^''^^''*^^ > 

1 - (1 + (5k/2F))"''^^^''*^\ 
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Since 1-(1+x)" ^ > c{x/p) if 0<x<p<1, where p is an Integer and c a constant depending 
on p, we can conclude 

ej>{c^8/f), 
for some positive constant c^, if 8 is small — say 6<1/2. Therefore 

for some other positive constant C2. Combining all of the above we see that there exists a 
positive constant C3 such that no term of Q or W exceeds 

expC-CgS 7} * m/k F ). 

From the definition of / we see that 

/ = 0(k^log(1/ij)+klogk+klog(1/6)), F = 0(k^/) 

where the implied constants are absolute. We can finally find an absolute positive constant 
C that incorporates also the effect of these O-constants and that of adding all the terms in Q 
and W. For that constant C we can then conclude that 

Q+W < exp(-C6%''''^m/[k^^klog1/ij+logk+log1/5)^]). 

We are now basically done. It only remains to check that the assumptions we used were 
justified. It is very easy to check that assumption (i) is satisfied for the values of Cj we have 

chosen. For our repeated applications of the set intersection theorem we need to know that 



Now 

t/(1-e-p) = t{1+{fi/2tF))^^^^''''^^ < t (1 + 5/(2(6k+1)Ft)) 

< t + 5/(2(6k+1)F) < 1 since t<1/2, 8<^. 

k 6k+1 

Now note that when we choose Cj we can assume that ij /(1-Cj) < t, for certainly we 
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4 

cannot have more progressions than the length of T. Thus to check T|/(1-ej) < 1 we look at 

6k+1,/^ .4(6k+1) .. 

Tj /(1-ej) . Now 

4 
6k+1,,. ,4(6k+1) , k,,^ .6k+1. 2k+1. 2k+1,. 
T| /(l-Cj) = (^ /(l-Cj) ) ^ ^^ ^^■ 

4 

Thus T|/(1-Cj) <1 is also proved. This completes the argument for the following result: 



THEOREM 3.4.1. If fi, 5q are positive constants while tj, 5, and k can vary with m so that 

< TJ < 1, 

1 < k = 0(log m), 
k ^ u 

< S < 8q, 

then there exists a constant C depending at most on Sq such that as m -♦ oo, except with 

probability not exceeding 

exp(-C5%''"'^m/[k^^(klog1/i)+logk+log1/8)^]), 
a selection H of rjm points in Z^ will have 

Tj''m{1±5) 
arithmetic progressions of length k coming from H to 0. . 



THEOREM 3.4.2. If t is a type of length k and / hits, then Theorem 3.4.1 applies to the 

k / k~/ 

enumeration of progressions of type t coming to 0, if throughout we replace t} by i] (l-ij) . 

That is under the assumption that 

Tj(l-Tj) m > m , 
we can conclude that the number of arithmetic progressions of type t coming from H to 
will be 

V(1-T])''"'m{1±5). 
except with probability not exceeding 

exp(-C5%*'''\l-7|)''''''%/[k^^{klog(T|""'(1-Ti)''')+logk+log1/6)^]). 

Proof. In the argument above intersect with H or the complement of it according to whether 
the type specifies a hit or a miss. I 



CORROLARY 3.4.1. The conclusion of Theorem 3.4.1 or Theorem 3.4.2 can be made to 
apply simultaneously for all progressions coming to all points x in Z^^ and all types not 

exceeding a certain length kQ = 0(log m). 
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Proof. Stmpty look at the sum of all the excluding probabilities. The total number of 
conditions we are imposing is a polynomial in m (e.g., <number of types> x <number of 
points>). Now use the fact that P{m)exp(-C^m'*) < expC-Cgm^) as m -*• oo if p denotes a 

polynomial and C2<C^. I 

This last corollary illustrates the power of the exponentially small bounds. 



COROLLARY 3.4.2. Under the conditions of Theorem 3.4.1, a selection H of ijm points in Z^ 
will have no more than 

2tj m 

pairs (x.y), x,y€H, of points In the specified ratio a:b, 1<a,b<k, except with probability not 
exceeding 

exp(-c/*^m/[k^^(klog1/r^+logk)^]). 



Proof. Assume the ratio a:b is in lowest terms. Then apply the argument of this section 
while only considering T^T and Tj^T in the pull-back process for each interval T. I 



The following generalization of Corollary 3.4.2 is needed in Section 3.8. The arbitrary 
notation used below is chosen to correspond to the context of that section. 

THEOREM 3.4.4. Let A denote a fixed subset of Z^ of cardinality at least m "*" ^. Let tj = 

m \ where 5^, 82 are small positive constants satisfying 82 > 5-,- Then there exists a 

small positive constant 5 such that: if a subset H of -qm elements is randomly chosen in Z^, 

then the number of pairs of points (u,v) in H with v€A, and u, v in the prespecified ratio a:b, 
1<a<b<k=0(log m), is 0(Tj|A|m ), except with probability that does not exceed exp(-m ). 



Proof. Let 8^ be such that 5^ < §3 < 52- Consider the pull-back process for a certain interval 
T of the partition. Let S^^ denote as before the b-th multiples of points of T. Then S^^ is a 
subset of T^T. Let x denote the number of points of A in S^^. We distinguish two cases. 
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Case 1: X > m "*" ^. We will apply the pull-back argument to only Tj^T and T^T. We start by 
a weak bound on the intersection of A and H in S^^. What is the probability that this 
intersection will exceed xm in size, where 5 is positive but less than 8^, 8^-8^, and 52"^3 ^ 
We use our Theorem 3.2.1. We have 

a - x/m, ^--q, 
1+£ = m /t| > m 

Then 

<p(e)aj8m > [(1+c)log{1+c)-e]a^m > [(1/4)logm m /tj - (m /i| - 1)]xi] 

for some positive constant C. Thus the probability under consideration is exponentially small, 
and we may assume that our intersection does not exceed xm . We now pull back this 
intersection to T^T, thereby defining S^. This is a disjoint set from S^, and applying once 

more our intersection theorem with £ this time small, say 6=1/2, we immediately conclude 
that, except with probability not exceeding 

exp(-C'7|xm ) < exp(-m ), 

the number of pairs (u,v) with uGHOTg^T, vGHflAriT^jT, u,v in the ratio a:b, is no greater than 

0(Tjxm"^). 

Case 2: X < m * ^. We now simply don't bother with the T^^T step of the above argument. 
Just pull back the entire AflS^^ to T^T in order to obtain S^. Thus S^^ has size < m ^, and 

since we are interested in maximizing the number of pairs (u,v), we will in fact assume that 

1/4+80 
Sg^ has size m ^. Applying Theorem 3.2.1. to S^ and H we obtain that, except with 

probability not exceeding 

exp(-C'xTi) < exp(-C'm^3"^^) < exp(-m*), 

the total of pairs {u,v) with uGHflTg^T, vGAHT^jT (and a fortiori those for which vCAnHHTgi) 
and u,v in the ratio a:b, is no greater than 

O(x-n) = 0{m^3"*i). 
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Now we sum the contributions over all T. The contributions from Case 1 are 0(7j(2yX)m ). 
Since the mapping z -*■ bz is 1^1, we must have 2-pX < |A|. Thus the total from case 1 is 
0(ij|A|m" ), which is at least Lm ^" ^' , by our assumption about the size of A. By taking L 
sufficiently large (say L = Cklog m, for some constant C) we can make the contribution of the 
fixpoint intervals negligible compared to this, say no more than m ^ \ Also the contributions 
of Case 2 are no more than 0{k (log m)m ^" ^) (just let all T's have a Case 2 contribution), 
and that too is negligible. 

By choosing the 8 in the statement of the theorem slightly smaller than the 8 we have used 
in the proof, the result follows. I 



Clearly the results of the last Lemma and Theorem can also be generalized so that they 
apply to all points and all ratios and types up to some maximum length kg = 0(log m) 

simultaneously. 

In a certain light what we have shown is that the occurrence of one arithmetic progression of 
length k in H influences very little the occurrence of another such progression. These 
progressions are nearly independent in the sense that they give rise to a distribution 
analogous to that of independent Bernoulli trials. This is why the results of this section could 
not have been obtained by variance arguments alone. It is interesting that for k=3 a similar 
result can be proved using the exponential sums technique of analytic number theory (see 
Appendix). Unfortunately that proof does not appear to generalize to k>3. 



3.5. Double Hashing 

A common technique for collision resolution in hashing is known by the name of double 
hashing. We described the technique briefly in Section 1.2. Here we review some of the 
definitions, before we embark on the analysis of this technique in the following sections. 
Our hash table consists of records, each uniquely identified by a key K. Entries in the table 
are either occupied or empty. The letters h,g will denote hash-functions, i.e., mappings from 

the set of all possible keys to an appropriate subset of the integers {0,1 m-1}, where m is 

the table size. We will ignore the complication of detecting a full table. With these 
assumptions the double hashing algorithm is (as in [Knuth2]): 
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DOUBLE HASHING : Assume m is a prime integer. Let h have range {0,1,...,m-1) and g have 
range {1,2,...,m-1}. 

D1. [First hash.] Set i <- h(K). 

D2. [First probe.] If TABLE[i] is empty, go to D6. Otherwise if KEY[i] = K, the algorithm 
terminates successfully. 

D3. [Second hash.] Set c *■ g(K). 

D4. [Advance to next.] Set i <- i-c; if now i<0, set i <- i+m. 

D5. [Compare.] If TABLE[i] is empty, go to D6. Otherwise if KEY[i] = K, the algorithm 
terminates successfully. Otherwise go back to D4. 

D6. [Insert.] Insert new record at TABLE[i]. 

The primality of m is essential in order to ensure that in step D4 all table positions are 
generated before a repetition occurs. 



A way to measure the performance of a hashing algorithm is by the average number of 
probes required to insert a new element into the table. Of course this average depends on 
how full the table is. If the table contains n elements, then this average is denoted by C'^. 

Since we speak of averages we need to define the probability distribution involved. The 
assumption we make, which is both natural from a theoretical viewpoint and quite justified in 
practice, is that h,g both select each of their allowed values with equal probability, 
independently of their values on other keys. 

If we let m get large, with n = am, a a fixed constant <1, it has been known from simulations 
that 

C, ~ 1/(1-a) 

with agreement to 1 or 2 tenths of a percent even for m~ 1000 (see [Bell-Kam], [Brent]). In 
the following sections we will prove that 

C'«m = 1/(1 -«) + 0(1) as m -* 00, 



Double Hashing, Page 84 



provided a<aQ, where Uq is some absolute constant (whose value lies between 1/4 and 
1/3). Thus we see that for 0<a<aQ double hashing is asymptotically equivalent to uniform 
hashing, a technique we described and analyzed in Section 1.6. 



This equivalence is somewhat surprising, since we would expect double hashing to do 
substantially worse than uniform hashing. The reason for this is that all probes in the case of 
uniform hashing are independent, while this is not so for double hashing. In other words, 
double hashing exhibits clustering; the probability that two keys will follow the same path is 
0(1/m ) not "zero" (0(1 /ml)) as for uniform hashing. The bad configurations for double 
hashing are sets of occupied positions containing an excessive number of arithmetic 
progressions. Such sets will tend to grow into sets with even more arithmetic progressions, 
as a bit of thought will show. So it is by no means true that all sets of n occupied entries 
are equally likely under double hashing. The sets with an abnormally high number of 
arithmetic progressions are those that will make C',^ large and are also exactly those most 

likely to be obtained by double hashing. The effect of our results is to show that the 
clustering effect is negligible in the limit. 

We will use the terms enfry, cell, slot, and point interchangeably to denote a position of the 
table. The word element or the adjective occupied will be used to distinguish the occupied 

positions. If we consider an arithmetic progression x, x+d, x+2d x+kd (where we 

interpret all algebraic operations mod m), then d will be called its distance and k its length. 
We will speak of it as an arithmetic progression coming to x. If x+d, x+2d,...,x+kd all lie in 
some set S, then we will speak of it as an arithmetic progression from S. Given a point x 
and a set SC{0,1,...,m-1} of cardinality am, the expected number of arithmetic progressions 
of length k coming to x from S is approximately a m, when we consider all such sets equally 
likely. This is so because there are (m-1) choices for the distance d and for each such 
progression we have a probability of 

C(m-k,am-k)/C(m,am) ~ a 

of belonging to S. In Section 3.5 we saw that except for a fraction of selections of S which 
is exponentially small, the number of arithmetic progressions of length k coming from S to x 

k 

will be in the range a m(1±8), for any small postive 8. This result gives us hope to prove 
what we want, as it shows that sets with an abnormally high number of arithmetic 
progressions are exceedingly rare. 
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3.6. The Seed Set and the Final Argument. 

In this section we prove that double hashing is asymptotically equivalent to uniform hashing 
for 0<a<aQ by using the results of the following two sections. Here Uq denotes an absolute 

constant, 1/4<aQ<1/3. Let us prestate here Corollary 3.8.1, which is the result we will need: 

COROLLARY 3.8.1. Given any a,^ such that 0<p<a<aQ, there exists a constant C^ and an 

initial configuration of jSm occupied positions such that for any small positive constant $, if 
we add {a-p)m points to the table using the double hashing process then, except with 
probability less than exp(-C^m O), we will arrive at a configuration of am occupied elements 

such that for each point x of the table and for each length k, 2<k<k'^ = C^ log m the number 
of arithmetic progressions of length k coming to x from the occupied points is a m(1±^^^). 
Here the 6^ ,^ denote relative errors satisfying 

k'a 

(a) 2 fa.k"*" < <?. 
k=2 

and 

,u^ ^ci,A a \ ^ 1/2-5' 

(b) a (1+^«k> )m < m , 

-a,rs ^ — 

where 6' and 5q are small positive constnats, and C^ is a constant depending on B only. 



What Is the average number of comparisons we need in order to find an empty slot in the 
resulting configuration, using double hashing? (Recall that, as in Chapter 1, we count the 
final probe into an empty slot as a comparison). As we make such a search, let p^ denote the 

conditional probability that we will make at least (/+1) comparisons before we find an empty 
slot, 0</<m, given that we hit at least one of the occupied positions. Thus we must on the 
first probe (h(K)) select one position among the set S of occupied positions, and then select 
a distance {g(K)) that leads to an arithmetic progression of length (at least) / among 
elements of S. The average number of comparisons for an unsuccessful search will then be 

m-1 
(i) 1 + a 2 P/- 

/=o 

We first dispose of the arithmetic progressions of length greater than k'^. Let us consider an 
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occupied point x€S (all such points are equivalent for the computation below). We claim 
that there are no arithmetic progressions coming from S to x of length exceeding 
(m +"l)k'a. This is so, since if we had such a progression, then we would have more 

than m progressions of length k'^ coming to x from S. (If d is the distance of the 

original progression, then d, 2d, 3d, ..., (m +1)d would all be distances of progressions 
of length k'^ that are subsets of the long progression). But this contradicts condition (b) of 

1 /P-/5' 

the above corollary. Now for lengths between k'^ and k'^(m +1), we can have at most 
as many arithmetic progressions as we have at k'^. The total contribution of these to (i) is 





1/2- 


■«■ 


k'„(m 




+1 


am 




S 


^ 




k=k' 


choices 


for 


x^ 




1/m(m-1) 

probability of choosing 
a specific x and a 
specific distance 



contribution of 
all distances 
in question 



0(k'^m'^^') = 0(1) as m~»>oo, 
since k'^ = O(logm). 



Here we have ignored the contribution of the excluded configurations, but these can 
contribute at most a total of 

m exp(-C^m °) = o(1) as m -»• oo 

maximum number of probes for any search to the mean, and so from now on they will be 
ignored for good. For the shorter k we see that our corollary implies 



Pk = « (^±^a.k) 



2 < k < k'^, 



and certainly Pq = 1, p^ = a. So the contribution of short lengths to Sp,^ is 



k'. 



1 +a+ 2 «'(1±^a,/) 



/=2 
k' 



/=0 



1/(1-a) ± ^ + 0(1), 
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where we have used conclusion (a) of the corollary and the fact that 

00 



2 a =0(1) as m->oo 



since k'^ -♦ oo as m -* oo. Combining all of our conclusions, we have proved that the 

average number of comparisons needed to find an empty position in a table filled up to load 
factor a as described in the corollary is 

1 + a/(1-a) ± a0 + o(1) = 

(ii) 1/(1 -a) ± aS + o(1). 

Unfortunately we are not done, because we did not start from an empty table. The double 
hashing algorithm was applied only after an initial seed of fim points was already 
strategically placed in the table. 

In order to complete our argument, we need to investigate the effect of these initial fim 
points. We have added (a-)8)m keys using the double hashing process. What if we had 
added these same (a-/8)m keys to an initially empty table using double hashing? Let us 
select a specific hash sequence (h(K^),g(K^)), (h(K2)),g(K2)),... and so on. Let S denote the 

set obtained by adding points with this sequence to the initial j8m set, and S' the 
corresponding set obtained by adding points using the same hash sequence to an initially 
empty table. Then we claim S'CS. Consider the first point K in our sequence, whose 
insertion would cause an alleged violation of this condition. Either our key K ends up in the 
same position in both S' and S, in which case there can be no violation, or our key continues 
on a longer search path in S than it did in S'. But then the location where K ends up in S' 
must have already been occupied in S, and so again no violation is possible. The above 
remark implies that the average number of probes to find an empty slot with configuration S 
is an upper bound for 0\^_q.^, i.e., 

C'(a-i8)m ^ 1/(1-«) + «^ + 0(1), 
or 

O'am < 1/(1 -«-i8) + («+)»)<? + 0(1), 
by a simple change of variable. (Assume j8 is so small that a+j8<1.) 

Next we get a lower bound for C'^^. For this paragraph only Og will mean O with reference 
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to ^ -* 0. We just saw that if we start with ^m points rather than an empty table, we can 
only do worse. But how much worse? Again let us fix our attention to the particular 
hash-sequence on hand {h(K^),g(K^)), (h(K2),g(K2))... etc. In the set difference S-S' we 

have ^m points. Now suppose we are at the final configuration S and let us look at an 
arithmetic progression of length k of occupied cells In S coming to x. If this progression 
contains at least one point In S-S', we shall say that it is destroyed. (This means that it 
contributed to the the computation of p, for a but will not contribute to the one for a-/8). No 

point in S-S' can destroy more than k such progressions, so the total number of 
progressions of length k coming to x that is destroyed is bounded by k^m. Of course we 
can never destroy more arithmetic progressions than there are, which is a m(1+^^,^). Now 

let Kq = log(1//8)/log(1/a). Then the number of progressions coming to x of length greater 

or equal to Kq that can possibly be destroyed Is 

m-1 



2 a^m{^+e^^^) = Oia'^rr)) = 0(^m), 



where we estimated the sum as we estimated sum (i) (using also the obvious fact that the 
errors 0^^^ are bounded for fixed k, as m ->■ oo). From 1 to k^ we can destroy at most 

ko^ iSm = 0^(yS log^(1/^)m) 

arithmetic progressions. Thus the total of destroyed progressions coming to x is 
O^ifi \og^{Ufi)m), 

and we have shown 

c'„m > 1/(1 -«-)8) - ioi+fi)e - o^ifi \oQ^i^/^)) + o(i) 

by arguing as in the previous paragraph. 



To summarize, we have 

1/(1-a-i8) - ia+P)e - O^ifi log^(1/)8)) + o(1) < C'^^n 
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< 1/(1-a-^) + {a+fi)e + 0(1). 
Since $, j8 can be taken to be arbitrarily small, we have proved. 

THEOREM 3.6.1. The average number of comparisons needed to find an empty slot with 
double hashing in a table of size m, filled up to load factoi a, a<aQ, is 

^'am = 1/(1-a) + o(1) as m -► oo. 

3.7. The Lattice Flows and the Extension Process. 

Let X be a point of the table, and let t be a type of length k with h hits and k-h misses. If ym 
points of the table are occupied, 0<y<1, then the expected number of arithmetic 
progressions of type t coming to x is y (l-y) m. In this and the following section we 
show that if we start with a configuration of j8m occupied points in which every point has 
nearly the expected number of arithmetic progressions of every type and grow this table to 
am elements using the double hashing process, then if we only exclude an exponentially 
small fraction of possibilities, we can be sure that the resulting configuration of am points 
will also have nearly the expected number of arithmetic progressions of every type coming 
to every point. 

To illustrate the argument we first discuss how we can prove such a statement if the 
additional (a-j8)m elements were randomly inserted. We add the new points in groups of Tjm 
at a time, where tj is very small compared to a or fi. Suppose we currently have ym 
elements in the table and are about to add rjm new ones. Fix a point x of the table and 
consider the arithmetic progressions of length k coming to x. The various types to which 
these progressions may belong form a boolean lattice, as illustrated by Figure 3.7.1. 



As the new tjm points are added, there will be flows upwards in this lattice, that is, some 
arithmetic progressions will shift into types with more hits. For example, the first point of a 
progression of type (001) may become occupied by one of the ijm points, whereas the 
second may stay empty, thus causing the progression to shift into type (101). In order to 
estimate the magnitude of these inter-type flows we need to introduce some notation. 
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THE LATTICE OF TYPES OF ARITHMETIC PROGRESSIONS 
OF A GIVEN LENGTH COMING TO A POINT 

k = 3 




arrows indicate 
inter-type flows 



"1" denotes a hit 
"0" denotes a miss 



Figure 3.7.1. 
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DEFINITION 3.7.1. Let x be a point of the table, t a type of length k, i an integer 1 <i<k, and 
r a configuration of Ym occupied positions; then by S(i,T,x,r) we will denote the set of the 
i-th points of the arithmetic progressions of type t coming to x. We also introduce the 
density 

<T(T,x,r) = |S(i.T,x,r)|/m, 

which is clearly independent of i (since m is prime). 

Throughout the arguments that follow we will be dealing with inequalities on the a(T,x,r). We 
introduce the symbol a{r,y) to stand for any of aCr.x.r), where x ranges over all points and T 
over all non-excluded configurations of ym occupied positions. Thus when we write 

<t{t,y) = X(1±/i), 

we mean X{1-/i) < a(T,x,r) < \/(1-/i) for all x and F as described above. 



We now present a heuristic argument for the case of random insertions. Assume that our 
configuration F is such that a{r,y) = y (l-y) " for all types t, where h denotes the number 
of hits and k-h the number of misses of the type. Thus S(i,T,x,F) = y {^-y) m. What 
happens to the arithmetic progressions of type t coming to x as the new rjm points are 
added? Consider a type t' which is t except a hit of t in the i-th position is a miss in t', 
e.g., T = (101), t' = (001) in the example above. Then for each point of S(i,T,x,F) that is hit 
by the Tjm, a progression may change from type t' to type t. This is illustrated in Figure 
3.7.2. 



There are (1-Y)m unoccupied elements, of which we are choosing T|m, thus the probability of 
selecting a point is ■q/{^-y). By hypothesis the size of S(i,T,x,F) is y (l-y) * rn. and 
therefore the expected size of its intersection with the Tjm is tjy (1-y) m. Since t has h 
hits, there are h possible positions at which such inflows into type t can occur (i.e. there are 
h possible feeder types t'), for a total of hi^Y (1-y) ^- Now some of the progresions of 
type T can move out of this type. For each i that corresponds to a miss of t, this will 
happen whenever the set S(i,T,x,F) is intersected by the rjm. We easily compute the size of 
the outflow to be (k-h)7]Y (1-y) " ' nn. In the above we have ignored the possibility that a 
transition between two types can occur with more than one of the points of a progression 
being hit by the rjm. Any flows arising out of such transitions, however, will have an 
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Figure 3.7.2. 



THE MECHANISM OF INTER-TYPE FLOWS 



flow , 

T ^' T 
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2 2 

expected magnitude of 0(k tj m) and since we take tj to be very small, they can be ignored. 
To total up, when the new Tjm points have been added, the expected number of arithmetic 
progressions of type t coming to x will be 

h/u xk-h . h-1.^ ,k-h ., .. h,^ .k-h-1 
7 (1-Y) m+hrjY (l-y) m-(k-h)7|Y (l-y) m 

which is (y+t|) (1-y"»i) " nfi if again we ignore 0{k tj m) terms. 



Thus S(i,T,x,r) = iy+ri) (1-y-*»?) " on the average, as we had hoped. By iterating this 
heuristic argument we see how we can grow from )Sm to am elements while having the 
expected number of arithmetic progressions of any type at each point at each step. 



In the next section we will show how to carry out this argument rigorously. For the 
remainder of the current section we confine ourselves to some definitions and general 
remarks. We shall use the term the extension process for this process of building up the 
table we are describing. This process consists of steps of adding Tjm points at a time. 
During each step, given any two types of progressions coming to a point x, there may be 
transitions of actual progressions from one type to the other. These inter-type transitions 
will be called flows. For each type we will have a certain inflow and outflow of progressions 
from It. Naturally we cannot assume that a type will have exactly the expected number of 
progressions, as we have done in the heuristic argument above. We introduce relative 
errors 6 on this expected value that describe the deviation we are willing to allow. In 

other words, when we are at load factor y, we assume that for each point x and each type t 
of length k (k not exceeding a certain maximum) and h hits, we will have 

Y^1-Y)^'V1±^Y,K) 

arithmetic progressions of type t coming to x. Here we have already adopted the 
convention that we will follow in the actual argument and suppressed the dependency of 

on anything but the length k of t. We will find that the errors 6 ^^ grow faster for larger k, 

but if we compute the total number of arithmetic progressions coming to x of types 
consisting entirely of hits, then the relative error on this total we will be able to make as 
small as we please. Now double hashing chooses each of the empty points with probability 
proportional to the number of arithmetic progressions coming from the occupied points to 
that point. The above remark then implies that during the current step every empty position 
is nearly equally likely to be filled. So we are not too far from the random situation. But 
how can we be sure that we will maintain the same good situation during the next step? 
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Here we invoke Theorem 3.2.1 to assure that all intersections between the -qm points and the 
various sets S(i, t.x.F) of Definition 3.7.1 are nearly the same size. In doing this we exclude 
an exponentially small fraction of choices of the rjm points, while increasing the relative 
errors 6 ^^ to ^Y+nk ^^^ *^® "®^* step. We will speak of using Theorem 3.2.1 for controlling 
the intersections, and therefore the flows. In order to keep the error propagation equations 
for . relatively clean, we will allow certain additional absolute errors as well (the 

"residual" progressions of the next section). During any step, if there is a number of 
progressions flowing between two types that is allowed by our control but cannot be 
accounted for in the relative errors we allow, this number we will speak of as an excessive 
flow. The gist of the argument then is that by excluding an exponentially small fraction of 
possibilities, we maintain at each step every empty position nearly equally likely to be filled. 
We never give clustering a chance to build up a really bad configuration. 



We now make a number of remarks that the reader should keep in mind while reading the 
next section. 



Remark 1. The types that ultimately play a role in double hashing are those consisting 
entirely of hits. Because, however, the population of types changes by Inter-type flows, we 
have to attempt to control all types at once. 

Remark 2. Suppose we wish to maximize the number of progressions in a type t consisting 
of k hits. During each step the significant inflows into t are those from types with k-1 hits. 
Obviously we should maximize these inflows. Now these inflows are also outflows from the 
"feeder" types one step below in the lattice. In order to maximize those same inflows during 
the next step, we want to maximize the growth of the feeder types during the current step. 
But these types have their outflows already chosen, so the best we can do is to maximize 
the inflows into them. An inductive extension of this argument shows that all flows in the 
lattice should take their maximum allowed value during every step, if we are interested in 
maximizing the growth at the apex of the lattice. Similarly if we wish to minimize this growth, 
all flows should be minimized. The point made here is important and somewhat subtle, and 
the reader should dwell on it for a moment. Another way to see the point is this. Consider 
one of the sets S corresponding to one of the feeder types. At the current step a fraction p^ 

of S will flow, where p^ is allowed to vary within certain limits. At the next step a fraction pj 

of the part of S that is left will flow, and so on, say up to p^. Then it is simple to see that the 

total fraction of S that has flowed is 
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9 

1 - n (1-p.) 

i=1 

and this expression is maximized when all of the Pj are maximized. The intuitive 

interpretation of this is that if we wish to maximize the total flow between two types, we 
should never trade the certainty of a specific transition in the current step for the probability 
of that same transition in some future step. 



Remark 3. If we are interested in maximizing the flows, it will only hurt our upper bound to 
make any of the sets S of Definition 3.7.1. that partake in the controlled intersections larger 
than it really is. 

Remark 4. Since we are dealing with non-negative quantities, a relative error smaller than -1 
clearly does not make sense. We do, however, allow such fictitiously large negative errors 
in the argument of the next section, since they can only make our lower bounds worse and 
they avoid consideration of special cases. 

Remark 5. If P(m) denotes any polynomial in m, C, 5^, fig constants with C>0, 62>8-|>0, then 
for m sufficiently large 

P(m) exp(-Cm^2) < exp(-Cm^i). 



Remark 6. let denote an arbitrarily small positive number and let »//(m) be a quantity which 
is o(1) as m -♦ 00. Then we will say that 4' can be incorporated in 6 to mean that, given any 
positive constant 0\ for m sufficiently large we can assume that the sum O+rpim) does not 
exceed 6\ We use this terminology on a number of occasions. This is justified because it 
will be trivial to check that the sum of the t/'(m) over all instances of the terminology that 
refer to the same 6 is o(1). 



Remark 7. We will make some use of the 0,o -notations. They always refer to m -> oo, and 
the implied constants are either absolute or depend at most on a, which is a constant of the 
entire problem. In Corollary 3.8.1. we also use the notations <, ^ with their usual heuristic 
meaning. If the reader wishes to have an exact meaning, then he may take, in the context 
where these occur, f :^ g to mean gm < f< gm , and f < g to mean f < gm , for some 
small positive 8. 
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Remark 8. The reader should realize that the process of intertype flows we have described 
is only a model for what occurs in the real table. The model will be used to bound the 
number of progressions we can actually have in the table. It need not be the case that the 
flows we use in the estimations of the next section can actually be realized by some 
sequence of insertions into the actual table. 



3.8. The Propagation of Errors and the Impotence of Clustering 

We will now carry out a precise estimation of the error propagation in the extension 
process. We assume a,fi are fixed constants, fi small, 0<fi<a<l. In the course of the 
computation we will find that we have to restrict a to be befow some absolute constant ctq, 

ttQ < 1. We take 

1} - m \ 

and define 



k^ = C(3/4-62)/log(1/)8)] logm (so p^Pm = m^^^^^^) 



(i) 



k« = [{1/2+53)/log(1/a)3 logm (so a'^^m = m"''^^'^3) 



where 5q, 8^, Sg, 8^, 8^ are small positive constants such that 

(ii) ^2 > 5^, Sg^^i > ^4 > 5o > (8q, 8^ will be used later). 

Our choice for t| is a compromise between two conflicting requirements. On the one hand 
we want to make tj as large as possible so as to get the maximum benefit from the law of 
large numbers and Theorem 3.2.1. On the other hand we want to take rj sufficiently small so 
that we can ignore the interactions of the -qm points among themselves. 
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During the extension process we need to maintain control over arithmetic progressions of 

length k^ since the argument of Section 3.6 depends heavily on our ability to push the 

1/2 
number of arithmetic progressions of length k^ below some power less than m . 

Unfortunately in the early stages of the extension process we are then out of luck. For 
types T of length k^ and many hits, the expected number of progressions of that type coming 

to a point will be too small to either assert anything initially, or to control the intertype flows 
by bounding the size of the intersections with the rjm points. To circumvent this 
shortcoming we introduce a technical device. For each point x and for each type t of length 
between kg and k^ we introduce an initial maximum positive "error" of size Eq = m "*" 2 jn 

the number of arithmetic progressions of type t coming to x. This error is in addition to the 
regular relative errors discussed in Section 3.7. As we will see, it provides us with a way of 
masking out the fact that we cannot control the size of the relative errors during the early 
stages of the extension process. These additional progressions will of course flow among 
the types like the normal ones we have already considered. We will call them the residual 
progressions and will control their flows independently of the regular progressions. 



We will ignore the outflow of residual progressions from any given type. Thus their number 
can only grow and will never become less than Eq. By analogy with Definition 3.7.1. we 

introduce the notations R{i,T,x,r), p{r,x,T) to denote the corresponding quantities for the 
residual progressions. Thus p{T,x,r) > m * ''. If at any moment during the extension 
process we have 



(F(T,x,r) < p(T,x,r) 

then we will not attempt to control the intersection of any of S(i,T,x,r) with the rjm. Instead 
we will control only S(i,T,x,r) U R(i,T,x,r), which has cardinality (a(T,x,r)+p(T,x,r))m. We 
also use the notation 

r(T,x,r) = |R(i,T,x,r)|. 

If the intersection of the rjm with S(i,T,x,r) was excessively large, then any excess we will 
relabel as residual progressions for the receiving type. Thus we can guarantee that none of 
the regular flows (i.e., flows of regular progressions) will be excessive, by allowing 
sometimes the residual flows to be excessively large (by at most the same relative error). 
The quantitative argument will be given in the proof of Theorem 3.8.1. Figure 3.8.1. attempts 
to summarize this camouflaging with the residual progressions. 
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Figure 3.8.1. CAMOUFLAGING WITH THE RESIDUAL PROGRESSIONS 





Case 1: |R| < |S| 



Case 2: |R| > |S| 
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DEFINITION 3.8.1. The generic variable p wll stand for any of p(T,x,r), the densities of the 
residual progressions. 

As we saw in the last section, it is our goal to perform the extension process so that at each 
step all empty points are nearly equally likely to be filled. Since at each step we introduce 
not one but tjm points all at once, we have to understand the interactions among the ijm 
points themselves. It is possible that an initial fragment of the ijm points will be placed so 
badly that it will greatly affect where the remaining of the rjm points will go. This, however, 
can only occur if during an insertion one of the rjm points interacts heavily with those 
previously inserted. 

DEFINITION 3.8.2. Suppose we have a configuration T of ym occupied positions and are 
inserting rjm additional points. An insertion of one of these points will be called bad if its 
probe path (i.e., the sequence of examined points before insertion) 

(1) contains an initial segment of length at least k^ consisting of positions of the ym 

and at most one position occupied by one of the iqm points, 
or 

(2) contains (at least) two of the m points among its first k^ (or fewer) steps. 

An insertion which is not bad will be called good. We let b denote the total number of bad 
insertions that have occurred when we reach a load factor of y. 

Figure 3.8.2. illustrates the different cases of good and bad insertions. 

What we will prove below is that the conditional probabilities that any two empty positions 
will be filled, given that they are filled with good insertions, are nearly equal. We introduce 
the quantity Xy ^o capture the relative error in the probabilities (recall ^ -|=0). 



DEFINITION 3.8.3. We let 



S/^. 



Xy- ^ y ^y,W 

k=2 
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Figure 3.8.2 THE GOOD AND BAD INSERTIONS 
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We now have all the concepts we need to begin the quantitative argument. 



THEOREM 3.8.1. Let a, fi be positive constants such that 0</S<a<ao- There exist absolute 
positive constants s, D such that, given an arbitrarily small positive constant 6, there exist 
positive constants i^, C^ (tending to as ^ -»• 0) such that: if we begin the extension 

process with a configuration of )8m elements placed so that for each point x and type t of 
length less than or equal to k^ we have the expected number of arithmetic progressions of 

that type coming to x within a relative error of i^ and (for those t of length at least kg) a 

residual error of at most Eq = m * ^ progressions, then, except with probaiblity 

exp(-C^m 0) where 6q is a constant, 0<6q<62'5i 5^, fig ^s defined by tj, k^ in (i), when we 

reach a load factor y we will have 

(a) e^^ < (1+i^)e^^^^ ^^^ - 1 for 2<k<k^ 

(b) Xy < Oy^ 

(c) p m < Eom(^/2 + 53)log[1 + 2log(1/(1-Y))]/log(1/a) 

(d) b^ < D.^m^''^'^ with 5, 0<6<53, 26^, a constant, 

where 9 , k, Xy. P^. b are as given by Definitions 3.7.2, 3.8.3, 3.8.1, and 3.8.2, respectively. 

Proof. We will prove assertions (a), (b), (c), and (d) by induction on the number of steps in 
the extension process. Thus we will assume that they hold for y and prove them for y+i). 
For y=^ all assertions are true trivially, except for (b) that requires that we take 

'\ff<0Qp^{'\-^), as we certainly can. We will see how to choose the constants D, s in the 

course of the proof. 

The proof is in two parts. First we examine the effect of the bad insertions, and second we 
look at the propagation of the errors. 

What is the probability of a bad insertion? Let us go back to Definition 3.8.2. An initial 
segment of length at least k^ will be entirely within the ym with probability cr(TQ,Y), where Tq 
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is the type of k^ hits. Similarly, the probability of encountering one of the rjm points in this 
segment is certainly bounded by k^aCr^.y), where t^ denotes any type of length k^ and k^-1 
hits. Thus the probability of condition (1) of the definition being satisfied does not exceed 

(iii) A(1+<9y,kJ + M''"'*^(1-Y)(1+<?y,kJ- 

We estimate the probability that condition (2) will be satisfied somewhat differently. We ask 
how many pairs (h{k), g{k)) are there that lead to a probe path satisfying (2), The probe 
path is completely specified once we know the two r/m poitns involved, and the positions of 
the two points on the path, say they are the b-th and c-th points respectively. Since we can 

2 2 2 

take 1<b<c<k^ we have at most Vi k^ ij m distinct probe paths. Each candidate pair 
(h(k),g(k)) defines la distinct path. Since each pair occurs with probability 1/m(m-1), we 

2 2 

have an overall probability (per insertion) of satisfying (2) that is bounded by k^ rj . 



From assertion (a) we have 

5 
^,k < (l-^*!?)® ■ "-1 



5<9(yVs)k 



and since 

kfl = [(I/2+63) '09 m]/[log 1/a] 

it follows that 

5^(YVs)[[(1/2)+y/(log 1/a)] 

ft 

which can be made <m ^ by taking sufficiently small, where 65 is such that < Sg < 53-5, 

26^-6. Thus the quantity specified in (iii) is less than or equal to m ^ for some h^h, 

2 2 
and so is k^ i} as the reader can easily check. The residual progressions of the types 

accounted for in (iii) have to be added in also, of course, but their number as given by (c) is 
less than or equal to 

^1/4+52 r^{1/2+53)'09C"'+3log(1/(1-Y))]/log(1/a) 



m 



— 1 /O — A 

which is less than m ^ for y<a<aQ as can be easily checked. We will encounter this 

aQ later also, so we will not dwell on its value any longer here. Thus we have proved 
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Claim 1. The probability of a bad insertion is never greater than m for P<y<oL<aQ. 



By Theorem 3.2.1 (or its equivalent for Bernoulli trials) the probability that we will have more 
than Dm ijm = Dijm bad insertions, for some constant D slightly larger than 1, is 

less than exp(-C(D-1) tjm ) < exp(-m '') and thus this event can be excluded. 

Therefore we can assert that at load factor y+tj the total of bad insertions will not exceed 

D(Y+T|)m^^^""^ 

as we desire in order to prove (d). Although we cannot say anything about where the badly 
inserted points will go, their number is so small that, as we shall see, they cannot destroy 
the final assertion of regularity of our configuration. 



We next show that any two empty positions have nearly equal probabilities of being filled 
with good insertions. Under double hashing the probability that a given empty position will 
be filled is proportional to the number of arithmetic progressions coming to that position 
from the occupied positions. Recall also that in a good insertion, the probe path is at most k^ 

long and in this path at most one of the new rjm points can occur. Let x be any empty point. 
The number of regular (i.e. non-residual) arithmetic progressions of length k, 0<k<k^, 
coming to x from the occupied points is by assumption y (1±^ i^)m, for a total of 



S ■i\^±e^y)v(\ , 



k=0 

or the discrepancy over the expected value is in absolute value at most 
k_ 



(S y'^^Y.k) ^ = Xym . 



k=0 

Each of the new ijm points can occur in the path, and each such point can introduce at most 
k additional progressions of length k coming to x (by being the 1st, 2nd,...,k-th point of the 
progression), for a total of 

K 
(iv) Tjm ^ k < k^ Tjm. 

k=0 
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The number of residual progressions coming to x is at most 



i ^ V ^ X. 1/2-6 

(v) Za P-J^ < k^Ti 

for a<aQ. Finally the previously badly inserted points can introduce each at most k 
progressions of length k, for a total as In (iv) of at most 

(vi) b^ 2 k < k^^m^^^"^ 

k=0 

progressions. We only demand in (b) that Xv can be made as small as any prescribed 

constant, and so the combined effect of (iv), (v), and (vi) can be accounted for by asserting 
that the deviation of the number of the arithmetic progressions coming to x from the 
expected value does not exceed 2xYm. Thus we have proved 



Claim 2. The probability that at a certain moment any specified empty point will be filled 
with a good insertion during the y to y+^ step is ((1±2xY)/(1-Y))m, independently of where 

any previously inserted elements among the -qm were located. 



(We have written 2Xy instead of 2(1 -y)x so as to incorporate the error that the (l-y) in the 

denominator can really vary between (1-y) and (1-y-i]).) The unavoidable bad insertions 
and the above small deviation from randomness is the way that clustering manifests itself in 
this argument. When we insert the new rjm points, the probability that an empty position will 
be filled is 



(vii) r = 'n{^±2xy)/{^-y). 



This ignores the effect of the bad insertions, but their contribution can easily be incorporated 
in the over-generous factor of 2 introduced above, since their number is 0(m ) which is 
much less than ijm = m "^ 



We are now ready to begin excluding those choices of the ijm points that would cause any 
of the inter-type lattice flows to be excessively different from the expected value. We do 
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this Simultaneously for the lattices corresponding to all k, 2<k<k^ (k=0,1 cannot vary from 

the average) and all points x. We control the flows by allowing a maximum relative error of 

Oq for the intersections of our rjm points with each of the sets S(i,T,x,r) of Definition 3.7.1, 

where F denotes our current configuration of 7m occupied positions. By Theorem 3.2.1 we 
can do this while excluding only an exponentially small fraction of the choices of the i]m 
points as long as the expected size of the intersection is not too small. At the beginning of 
this section we introduced the residual progressions as a device for handling the small 
S(i,T,x,r). For each i, t, and x we demand that the Intersections of both S{i,T,x,r) and 
R(i,T,x,r) with the -qm are within {^±$q) of what we expect if |S(i,T,x,r)| > |R(i,T,x,r)|, 

otherwise we only demand this of the (disjoint) union S(i,T,x,r) U R(i,T,x,r). In the latter 
case the intersection will have up to (a(T,x,r)+p(T,x,r)) r)*{l+6Q)rr\ points. By relabelling 

some regular progressions as residual we can then still claim that the flow corresponding to 
the intersection of S(i,T,x,r) with the -qm is t|* a{T,x,T) (1±^Q)m, provided we allow the flow 

corresponding to R(i,T,x,r) to get as large as p(T,x,r) Tj*(1+^Q)m. Furthermore now no set 

1/4+80 
whose intersection with the m we desire to control has cardinality less than EQ=m '^. 

Theorem 3.2.1 then implies 



Claim 3. During the step from load factor y to load factor y+tj, if we exclude a fraction of 
choices of the rjm points that does not exceed exp(-C^m ^) (for S^ as in (ii)), then we can 

assume that the intersection of the rjm points with each of S(i,T,x,r) (t a type of length at 
most k ) will have cardinality a{T,x,r) T)*(1±^Q)m, and the intersection of the -qm with each 

of R(i,T,x,r) will not be larger than p(T,x,r)Tj*(1+^Q)m. 



We now compute the relative error ^^^-i^ in terms of $ ^^. Let t, the type we are now 

considering, have length k and / hits. We saw in Section 3.7. that in order to maximize the 
relative error for the type of k hits, we may assume that all inter-type flows are maximal. As 
we will see momentarily, we can ignore any flows caused by progressions hit by more than 
one of the -qm points. Thus the maximal number of progressions of type t we can have at 
any point when we reach load factor (y+t|) is 

^ ^ ^-(k-/)Y'(1-Y)''"'(1+^Y,k)^*(1+^o)3m 

already inflow outflow ^ ^ "+" since a// 

there flows are maximal 
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Figure 3.8.3. illustrates the inflow and outflow of progressions from a type. 

Ignoring the factor of m we can write the above expression as 

-{k-l)y'{1-y)^''{1+ey^^){1+$Q)'ni1+2xy)/i1-y). 

I k-/ 

This has to equal iy+r}) {1-y-rj) {1+^^+-!^), and so we get 

Y(1-y) 
(y+'n)i^-y-r|) 

-lik-l)'n/{^-y):\{^+ey^){^+2xy){^+eQ)'j. 



Now 

L, ,k-/ 
y(1-Y) 

— - = 1 - /tj/y + (k-/)Tj/(1-Y) + 0(7j^). 

(Y+1?) (I-y-^) 



2 

If we ignore the tj terms, then we can rewrite the above as 
^+i,.k = ^,k + ('?'/Y)(^o+2Xy+26loXY)<?Y.k 
+ (7,//y)(^o-^2x^+2<9oXy) 

- [T](k-/)/(1-Y)] (V2XY+2^oXY)<^Y.k 

- [Tj(k-/)/(1-Y)] (V2Xy+2<?oXy)- 

The effect of the tj terms can be incorporated in the constants 0q or x^, and so these terms 
can be justifiably ignored. We maximize the error ^y+nk ^"^ taking /=k above, so our final 
error propagation equation becomes 

('X) ^Y+'^.k = ^Y.k + (i]k/Y){V2XY+2^oXY)<^Y.k + {'?k/Y)(V2XY+2<?oXY)- 
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Figure 3.8.3. THE INFLOW AND OUTFLOW OF PROGRESSIONS 

FROM A TYPE 
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Now we have Xv^^Y^ ^nd we take O^Kdy^, where s is a constant to be chosen below. For 
6 sufficiently small we will have Bq<1 and so we can make the errors ^^ only larger by 
writing 

Going back through the above derivation and changing the signs of 0y^„i^, ^^.k' ^c ^"^ Xy 
gives us the error propagation equation for the negative errors. (Now we want all flows to 
be minimal.) We get the equivalent of (viii) for the absolute value of the error: 

^y+ij.k = ^.k - ivi/y)i^o + 2xy - 2^oXY)^.k 

+ i-ni/yU^o + 2xy - 2i9oXy) 

+ [i?(k-/)/(1-y)](^o + 2Xy - 2l?oXY)<?y,k 
- [T?(k-/)/(1-Y)]«?o + 2Xy - 2<9oXy)- 



Since y<aQ which we can take less than 1/2, we have y<1~Y. Q^d so we can conclude that 
(ix) is valid for the absolute value of the negative errors as well. Thus equation (x) is 
justified for the absolute value of both positive and negative errors. 



We still have to estimate the size of the flows that involve more than one of the rjm points. 
Let us look at an arithmetic progression of length k coming to x that changes type by 
receiving two of the ijm points. Suppose these two points occupy positions i and j of the 
progression respectively, 1<i<i<k. Let us fix the two types involved, which fixes i and j, 
and then ask how many progressions can flow between these two types. If the donating 
type is T, then at most one such progression can flow for each pair (a,b) of the Tjm points 
with the property that a€S(i,T,x,r) and a and b are in a distance ratio i:j from x. If we allow 
the Tjm points to range over all m points, not just the remaining (1-Y)m ones, we can only 
increase the flow in question. But now we are exactly in the situation covered by Theorem 
3.4.4. and thus we can assert that our flow, except with exponentially small probability, will 
be 0(r^|S{i,T,x,r)|m ). Summing over all possible choices of i and j we still get a total 
possible inflow into the receiving type of 0(k^ Y (''"Y) V^ ) only, where / denotes 

the number of hits of the receiving type t'. A trivial extension of the above argument shows 
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that any flows into t' arising from types with 3 (or 4, etc.) fewer hits will not be any greater. 

/ K-/ 

Thus the total magnitude of these flows combined will be o(y(1-y) ijm) and can therefore 
be incorporated into the relative errors permitted in equation (ix). Our derivation of equation 
(x) by ignoring type transitions which involve more than one of the ijm points has been 
justified. 



We are finally at the point where we can push assertion (a) of our theorem through the 
induction step. We would like to prove 

*n„K - (l^i,)e^''t(r.,)Vs]k . , 

and since $ can be taken arbitrarily small, this is 
(1+i^)e^^^^ ''^^^1+5^Y^"''kT|+0(kTj^)) - 1 

= (1+<?Y,K)<1+5^Y^""\T,+0(kT,^)) - 1 

2 s- 1 

again the 0{k7j ) term is negligible compared to the 5^y ^tj terms, and can be incorporated 
in the constant 6. Thus we need 

V^I-k = ^,k + 5^Y^'''kT,(1+^^,,), 
which is exactly what we have proved in (x). 

Next we prove (b) and determine the constant s. It will be simpler to let the sum 

K 



S e. ^ 



k=2 
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go to Infinity, which we are allowed to do, since this can only increase the bound for Xy So 



00 
k=2 



Xv < S <?, •" 



then substituting 6 ^^ from (a) and letting 



^A ^ ^5I9(yVs) 



we get 



Xy < {1+i^)rW(1-Ye^ - y^/(1-Y) 

= i(9YV\(1-ye^) + (e^-1)y^(1+e^-ye^)/(1-y)(1-ye^). 

A A s 

For small we have e ~ 1, e -1 ~ 5^(y /s), and so if we take s so lage that 

(5/s) {2a^/{1-af) < 1/2, 
and i/j so small that 



'6 



\0 < ({^-a)/2a^)^^i 



then we will have 



Xy<<?Y^ 



as desired. Note that s is independent of and y. 

fk /) 
The last object of interest is the residual sets. How fast can they grow? Let r ' denote 

p(T,y)m for t a type of length k and / hits. The change in the r 's as we go from y to y+rj 

load factor can be computed in a manner analogous to the above. The maximum flow into t 
during the current step will be 
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We are not counting the outflows, so we obtain the recurrence relation 
r,,^^^''>= r;»^'').2/r^<^''-^V(1-Y), 0</<k . 

(Here we see that types with more hits will grow faster than types with fewer, since they 
have more types feeding into them. The same, of course, was true in our computation of the 
relative errors, but there we decided to ignore this improvement. Because the residual sets 
cause the argument to fail for large a, we want to do a better job of estimating their growth.) 

(k k ) 

Since all initial values are identical, It is clear from (xi) that r «' « is the maximally growing 

type. In what follows therefore we restrict ourselves to estimating its growth. Again, since t| 
is infinitesimal compared to / or r , we can easily check that the solution to the above 

difference equations (which we can think of as the system of differential equations 

dr^*^''VdY = 2/r^*^'''''V(1-Y), 0</<k) 
is of the form 



r^^*^''^ = Eq (1 + 2log (1/(1-y)) - 2log (1/(1-)5)))' 
< Eq (1 + 2log (1/(1-Y)))'. 



Thus 



(k k ) ,- ^(1/2+5^)logm/log(1/a) 

r^^K«,K„} ^ Eq [1 + 2log(1/(1-Y))] ^ ^ 

(1/2+53)log[1+2log(1/(1-Y))]/log(1/a) 



and so 



= Eq ^ 



. (1/2+6^) log[1+2log(1/(1-Y))]/log(1/a)-x 



Y 

as (c) of Theorem 3.8.1 requires 



There are at most I/tj = m * ^ steps, at most m points x, at most 22<|^<[^ 2 « < 2 « 



-H'^-i'^a 



= 2m * 3)°9 °9^ ^> distinct types and at most k^ values for i in the context S(i,T,x,r) 
Thus the total number of excluded cases is a polynomial in m, and no case has probability 
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higher than exp(-C^m ^). Thus as m gets large the total of the probabilities of the excluded 
cases is less than exp{-C^m O), where 5q is as constrained in (ii). 

This completes our proof of Theorem 3.8.1. I 



COROLLARY 3.8.1. Given 0<^<a<aQ<1, for any small positive constant 0, there exists an 

initial configuration of ^m occupied points, such that if we add (a-/S)m additional points using 
the double hashing process, then except with probability exp(-C^m °), we will arrive at a 

configuration of am occupied positions such that for each point x and for each length k, 
2<k<k'^, the number of arithmetic progressions coming to x of length k from the occupied 

points will be a {^±0^^^). Here the 0^^^ are relative errors satisfying 

k a 

(a) S ictM''^ < 0, and 
k=2 

(b) a^'«(1±<9^^. )m < m''^^"^', 

for some positive constants 5q, 8'. (Notice that we have excluded any references to bad 
insertions or residual progressions.) 



Proof. This is a direct consequence of Theorem 3.8.1, except for a few items that we need 
to check. First is the existence of a good initial configuration, the "seed" of the extension 
process. We have to choose a configuration of j8m points so that for each point x and each 
type T of length k and / hits we have 

progressions of type t coming to x, with i^, k, / restricted as in the theorem. Now if 

fi{^-^) m>m , then by Theorem 3.4.2., all configurations except for a fraction not 
exceeding exp(-Ci^ m ) of them will satisfy the above condition. If /?(1-/S) "m < m 

then let t^ denote a shorter type which is an initial segment of t, of length k^ and /^ hits, 

such that p^{^-p) "■ "i m ~ m . Then we can apply Corollary 3.4.2 to t^ and claim it has 

1/4 

at most 0(m ) progressions. Clearly t cannot have more progressions than t^, so 
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1/4 

therefore the excess of progressions t can have is at most 0(m ). Any such excessive 
progressions we label as residual for out type t. This is consistent with the assumptions of 
Theorem 3.8.1. that allow initial residual errors as large as m * ^ per type. Therefore all 
configurations of the firo points except for an exponentially small fraction of them satisfy the 
initial conditions of Theorem 3.8.1. We start the extension process by choosing one of 
them. This is an interesting "non-constructive" aspect of our proof. We do not know how to 
find a specific such good configuration, though we have just proved that almost all 
configurations are good. 



We now perform the extension process till we reach the load factor a, as described in 
Theorem 3.8.1. The number of points inserted with bad insertions is 0(m ). For each 
point X and each length k, no bad point can introduce (influence) more than k progressions 

A / — K 

of length k coming to x. Thus the bad points can introduce at most 0(km ) progressions 
of length k coming to x, k<k^=0(log m). The number of residual progressions of length k 

coming to x is at most 



Egm 



(1/2+5^)log[1+2log(1/(1-a))]/log{1/a) 



= m 



1/4+5^+(1/2+63)log[1+2log(1/(1-a))]/log(1/a) 



The absolute constant Oq is chosen so that 

1/4 + §2 + (1/2+53)log[1+2log(1/(1-a))]/log(1/a) < 1/2-6 

for a<aQ. A rough numerical computation shows that 
Oq ~ .319. 

1 JO-K 

Thus the contribution of the residual progressions at any length is at most 0(m ). Let 

now 6', 5" be such that 0<6'<6"<6,6o. Let k" < k„ be such that 

k' 1/2-5" 

a "m = m ; 

such a k'^ clearly exists since 

k/v 3 

a "m = m . 



We have 



5^(aVs)k' 
^^B^y -- {1+i^)e 



a 
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Recall that i^-*0 as 0-^0, and so by choosing $ sufficiently small we can obtain 



1+^/yk' < m* . for 5'"<5"-5'. 
So we have 

We can add to this the contribution of the bad Insertions and the residual progressions, and 

1 /2-S" 
since they both are 0{m ), the grand total of progressions of length k'^ coming to x is 

a*^'«m(1+^^k. )<m'''^'^'. 
-a,K ^ - 

This proves part (b) of the Corollary. 

For part (a) we work analogously. We know that 



(xii) 2 ^a,k^^ ^ ^"^ ^ ^• 



k=2 

The contributions of the bad insertions and the residual progressions estimated as above are 
0{m ) even when summed over all allowed lengths k. Thus these contributions to (xii) 

can be incorporated in the constant 6. Since k'^<k^, we have shown that the true relative 

errors satisfy 

k=2 

as desired. 
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By Theorem 3.8.1. the probability of the excluded events is exp(-C^m O). This completes 
the argument. I 



Remark. It is worth pointing out the reason why we have carried out the computation of the 
growth of the residual progressions separately from the regular ones. For the regular 
progressions, the initial number of progressions of a type t of length k and / hits is 
approximately j8(1-)8) m. Thus for fi small and a particular k we have most regular 
progressions in types with few hits. The initial number of residual progressions, however, is 
the same for all types, thus giving rise to a quantitatively different model. 

Figure 3.8.4. is to be used for reference. It summarizes the various 8's we have introduced 
and the relations among them. 
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DEFINITIONS. 



P^Pm 



a'^«m 



= m 

= m 

= m 

= m 



1/4-5. 



1/4+5. 



1/2-53 



p m''/4+52 



prb. of excluded events = e ^ 
Prb. of bad insertion = m 



CONSTRAINTS. 



52 > 5^ > 



52-5^ > 54 > 5o > 



25^, 53 > 5g > 5 > 
25^-5, 53-5 > 5^ > 



53, 5 > 5" > 5' > 



Figure 3.8.4. THE PROLIFERATION OF DELTAS 
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Appendix 

On the Distribution of Aritlimetic Progressions of 

Length 3 in a Random Sample of Integers 

1,2,.. .,N — A Derivation Using the 

Exponential Sums Technique 



We consider subsets S of [1,N] generated by the following random process: each x€£l,N] 
is chosen to belong to S with a fixed probability X, 0<X<1, independently of all the other x. 

ICaveat We have chosen this probability model versus the one in which we regard all S 
with |S| = XN as equally likely because the manipulations are easier; almost certainly the 
identical derivation holds for this second model -- places in the argument where the two 
models differ are indicated by t-] 

If we consider the exponential sum (a€R) 

S(a) = S e(ax), (we write e(t) for e^'^'S, 

then it is easy to see that the number of arithmetic progressions of length 3 in S is given by 

So S^(a)S(-2a)da. 

If we set T(a) = X 2 e{ax), then it is again easy to see that the average number of arithmetic 
progressions of length 3 of a set S generated as above is 

($) Jo T^(«)T(-2a)da = X^N^/4 + 0(N) as N -► oo. 

The constants implied in all our uses of the O notation will be absolute. 



A set S generated by the above process will be called (m,£)-equidistributed, e < 

1/2 
min{X,1-X}, m<N, if for all q, 1<q<N , and for all n, 1<n<N-mq, the number of elements 

of S that lie on the arithmetic progression n, n+q,n+2q,...,n+(m-1)q, is not more than (X+e)m, 

and furthermore |S|>(X-e)N. 
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In order to study the deviations of the number of arithmetic progressions of length 3 from the 
average, we need the following three fundamental lemmas. 

The first lemma is just a summary of trivial but useful properties. 

LEMMA 1 . We have, with S,T as above 

(1) S(a), T(a) are periodic of period 1; 

(2) S(a) = 0(XN), T(a) = 0(AN) as N -* oo; 

(3) Jo |S(a)|^da = XN; 

(4) T(a) = 0(X/llall), as Hall -* 0, where Hall denotes the distance from a to the 
nearest integer. I 

The next lemma asserts that for equidistributed sets S, S(a) is well approximated by T(a). 

LEMMA 2. If a set S generated by the above process is (m,£)-equidistributed, then for m = 

'/2 

o{N'') 

(5) |S(a)-T(a)| < 3£N+0(mN'''^). I 

The last lemma shows that non-equidistributed sets are extremely rare. 

LEMMA 3. The probability that a set S generated by the above process is not (m.c) 
-equidistributed is 

2 

Oihfi^^ e'^ ^), as N,m -* 00 I 
Using those three lemmas we can then prove our main result, which is 

2 

THEOREM. With probability 1-0(N^^^ e"* "^) we have 



I j*o's^(«)S(-2a)da - j"oV^(a)T{-2a)da| < 



3eXN^ + 0{AmN^^^ + X^(A+e)N^^^), as N,m -* oo. 
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Lemma 1 is obvious except possibly for part (4). This is easily done by summing the 
geometric series involved. 

We now make two comments and then proceed to prove Lemmas 2 and 3 and the Theorem. 



Comment 1. From the proof of Lemma 2 it will be clear that we could have allowed up to 

V2 
0{mN ) "exceptions" (I.e., selections of arithmetic progressions not satisfying the stated 

conditions) and still gotten our result. This almost certainly implies that in our proof of 

22 

Lemma 3 we can get a sharper estimate, possibly 0(N ^^ e"^ "^ ). 



Comment 2. The optimal value of m for the theorem is to choose m as large as possible 

1/2 
while still consistent with the assumption m = 0(N ). The theorem then, loosely speaking, 

1/2-5 

states that for any small 5>0, with probability 1-0(e''* ), where jLt(e) is some function 

of e>0, the number of arithmetic progressions of length 3 in a set S generated by the above 
process will lie in the interval [{(X^/4) - £)N^, ((X^/4) + £)N^], as N -* 00, 



Proof of Lemma 2. By looking at the continued fraction expansion of a, we can find h,q,j8 
such that 

a = (h/q) + )8, (h,q) = 1, q<N'^', q|j3|<1/N''''. 

We now start from the relation 

q m 

S(a) = (1/mq) 2 2 2 e(ax) + 0(mq); 

r=1 n=1 n<x<n+mq 

x€S, x=r(mod q) 



this relation is true because, for given x,m,q, there are exactly mq integers n satisfying 

n<x<n+mq, 

and these integers n also satisfy 1<n<N provided that mq<x<N-mq. Thus the coefficient of 
e{ax) on the RHS is unity except when x<mq or x>N-mq, these cases being compensated 
for by the error term. 
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We also have e(ax) = e(rh/q)e()8n) + 0(mq|j8|). For each r,n the number of terms in the 
Inner sum is at most (A+£)m by our assumption of (m,e)-equidistribution, thus it is (A+e)m - 
D{m,n,q,r), where D>0. 

Therefore 

q N 

(a) S(a) = (A+e) (1/q) 2 e(rh/q) 2 e(j8n) - 

r=1 n=1 

q M 

-(1/mq) 2 e{rh/q) 2 e(^n)D(m,n,q,r) + 0(mq) + 0(Mmq|^|). 

r=1 n=1 

If we put y3=0 and h=0 (legitimate since we have not yet used (h,q) = 1) in the above we get 

q M 

(b) |S| = (X+£)N - {1/mq) 2 2 D(m,n,q,r) + 0(mq). 

r=1 n=1 

l/o 

Since S is equidistributed we have {X-£)N<|S|<(A+e)N; using this and the facts that q<N , 

1/2 
q|)S|<1/N , we can combine (a) and (b) into 

q N 

|S{a) - (A/q) 2 e(rh/q) 2 e()8n)| < 3cN + 0(mN'''^). 

r=1 n=1 



1/2 
We now distinguish two cases. If llall<1/N , then we have h=0, q=1, ^=a and the above 

relation becomes 

|S(a) - T(a)| < 3£N + 0(mN''^') 

1/2 
which is what we want. If however Hall > 1/N , then we cannot have q=1. But in the event 

q>1, 



q 
2 e(rh/q) = 0, 

r=1 

1/2 
so the above relation becomes |S(a)|<3£N + 0(mN ). But from Lemma 1, (4) we have 

T{a) = 0(A/llall) = 0{XN^^), 

and therefore again 
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|S{a)-T(a)| < 3eN + 0{v[)u\ 



Q.E.D. I 

Proof of Lemma 3. We use the following fundamental property of the binomial distribution: 

(*) 2 C(a,t)p^q^"* < exp[(a-k) log(aq/(a-k)) + k log(ap/k)] 

where the sum is either over t such that t>k, or over t such that t<k, provided that k>pa or 
k<pa respectively. 

For S not to be equidistributed we must have 

(a) |S| < (X-£)N or 

(b) 3 n,q s.t. the number of elements of S in the progression n,n+q,...,n+(m-1 )q 
is >(X+c)m. 

Using the elementary fact that 

[(1-X±c) log(1-X)/(1-X±e) + (\Te) log(X/{xTc))] < -e^ 

we see that the probability of (a) happening is 0(e ) (set k = {X-c)N, a=N, p=X, q=1-X in 
(*)), and the probability of (b) happening is certainly less than 

(set k = (X+£)m, a=m, p=X, q=1-X in (*)). 

So the probability of either (a) or (b) happening is 

2 

0{N^^^ e"^ "^) as N -^ 00, m -» 00, 

as the lemma asserts. I 

Proof of the Theorem. (For O notation: N -> oo, 6 -^ 0.) By Lemma 3 we may assume that 
S is {m,c) -equidistributed. We have 

|S^(a)S{-2a) - T^{a)T(-2a)| = 
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|(S^(a)-T^(a))S(-2a) + T^{a)(S('2«)'T(-2«))| < 

|S(-2a)(S(«)+T(a)KSC«hT(€r))f + {T^(aHS(-2a)-T(-2a))| 



and by Lemma 2 and Lemma 1, (2) It follows that 

(6) |S^(«)S(-2a) ' T^(a)T('2a)t ^ O(eA^N^). 



From Lemma 2 and Lemma 1, (4), we have 



|S(a)l<3eN+0{\/H«lt+mN'''^), 



and thus 

Jg ^|S(«)lda < 36N/2 + OhXloga + mN*^^), 



which, together with Lemma t, (3), impHes that 

1-5 



(^> tJs S^{a)S(-2a)da| < SeXN^ + 0(-X^NIog5 + XmH^^^). 



By (6), Lemma 1, (2), we also have 



(8) 



5,g S^(«)S(-2a)da =^ J.^ T^(a)Tf-2a)dct + OieX^^a). 



Finally, from Lemma t, (4) and Cauchy-Sehwartz we have 

iJs J^iam-2a)M^ < (J^ [T(«>|^d«)^ Jg ' |T(-2a)|d« 

(Is 0(X^/fl«{t^)d«)^ 5 0(\^/$^) as 5 -* 0. 



So 



(9) fJ^ T^(a)T{-2a)da( = 0(aV^^\ 5 --* 0. 



We now write 
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I j*0^S^(a)S(-2a)da - j'oV^{aH-2a)da| < 

I J.g S^(«)S{-2a)da - j.g T^{a)T(-2a)d«| + 

ijg S^(«)S{-2a)da| + |J^ T^{a)T(-2a)da| < 



(by (7), (8), (9)) 



< SeXN^ + 0(\^NIog5 + XmN^''^ + eX^N^S + X^5"^^^) . 



Setting 6 = N" "^ in the above to get minimum growth of the O term, we obtain the desired 



result: 

If S is (m,e)-equidistributed, then 



ij'o S^(a)S{-2o«)da - Jq T^(«)T(-2a)dai < 3eXN^ + 

0(XmN^^^ + X^(X+c)N^''^). 
I. 

The arguments in this appendix were motivated by the discussion of [Roth]. 
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